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Message from the Program Co-Chairs 
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for their support. Thanks to the USENIX staff for handling the conference logistics, marketing, and proceedings 
publication; it is a pleasure to work with them. We extend special thanks to Eddie Kohler for providing and sup- 
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Abstract 

A simple yet remarkably powerful tool of selfish and 
malicious participants in a distributed system is “equiv- 
ocation”: making conflicting statements to others. We 
present TrInc, a small, trusted component that combats 
equivocation in large, distributed systems. Consisting 
fundamentally of only a non-decreasing counter and a 
key, TrInc provides a new primitive: unique, once-in-a- 
lifetime attestations. 

We show that TrInc is practical, versatile, and easily 
applicable to a wide range of distributed systems. Its 
deployment is viable because it is simple and because 
its fundamental components—a trusted counter and a 
key—are already deployed in many new personal com- 
puters today. We demonstrate TrInc’s versatility with 
three detailed case studies: attested append-only mem- 
ory (A2M), PeerReview, and BitTorrent. 

We have implemented TrInc and our three case stud- 
ies using real, currently available trusted hardware. 
Our evaluation shows that TrInc eliminates most of 
the trusted storage needed to implement A2M, signifi- 
cantly reduces communication overhead in PeerReview, 
and solves an open incentives issue in BitTorrent. Mi- 
crobenchmarks of our TrInc implementation indicate di- 
rections for the design of future trusted hardware. 


1 Introduction 

As wide-area systems grow in scale, so do their ex- 
posure to threats. Much of the interesting distributed- 
systems research of the past decade has focused on the 
issues of security and adversarial incentive that are inher- 
ent to large-scale systems. This research has addressed a 
wide range of applications, including storage [2, 16, 19, 
22, 28], communication [4, 45, 30], databases [40], con- 
tent distribution [15, 24, 32, 36], grid computation [12], 
and games [3, 10], in addition to generic infrastruc- 
ture [1, 5, 9, 18, 23, 43]. Virtually all of this work shares 
a common supposition, namely that the individual com- 
ponents in the system are completely untrusted. 

Recently, the necessity of this supposition has been 
called into question. The Attested Append-only Mem- 
ory (A2M) system by Chun et al. [7] showed that a small 
trusted module in each distributed component can signif- 
icantly improve system security. In addition to found- 
ing this important new research direction, A2M made 
two key contributions: First, they proposed a particu- 
lar abstraction for such a module, namely a trusted log. 


Second, they showed specifically that their proposed ab- 
straction could improve the degree of fault tolerance 
to Byzantine faults in the server components of client- 
server systems. 


Despite our appreciation for this work, we are con- 
cerned that distributed-protocol designers may be reluc- 
tant to start assuming the availability of such trusted 
modules. We have two reasons for this concern: First, 
the abstraction of a trusted log may require more stor- 
age space and complexity than researchers are comfort- 
able assuming, particularly for an embedded module in- 
side a potentially hostile component. Second, designers 
may have difficulty appreciating how broadly applicable 
a trusted module can be to distributed protocols. 


In this paper, we continue the research direction begun 
by A2M, with an eye toward addressing these two issues. 
First, we have developed a significantly smaller abstrac- 
tion: Instead of a trusted log, we propose a trusted in- 
crementer (TrInc), which 1s little more than a monotonic 
counter and a key. Second, we demonstrate a more inclu- 
sive set of architectures, running a broader range of pro- 
tocols, yielding a wider set of benefits: Our architectures 
include not only client-server systems but also peer-to- 
peer systems. Our protocols include not only Byzantine- 
fault-tolerant protocols but also PeerReview [13] and Bit- 
Torrent [8]. Our demonstrated benefits include not only 
improving fault tolerance but also reducing communica- 
tion overhead and solving an open incentive problem. 


We show that TrInc has several benefits over A2M. 
First, its smaller size and simpler semantics make it 
easier to deploy, as we demonstrate by implementing 
it on real, currently available trusted hardware. Sec- 
ond, we observe that TrInc’s core functional elements 
are included in the Trusted Platform Module (TPM) [38] 
found on many modern PCs, lending credence to the 
idea that such a component could become widespread. 
Third, TrInc makes use of a shared symmetric session 
key among all participants in a protocol instance, which 
significantly decreases the cryptographic overhead. 


The rest of this paper is structured as follows. 82 pro- 
vides background on the underlying problem addressed 
by TrInc (and by A2M), as well as a primer on trusted 
hardware. §3 then presents the design of TrInc, and 84 
analyzes its security. 885, 6, and 7 respectively describe 
several protocols we modified to use TrInc, our trusted 
hardware implementation, and our evaluation thereof. 
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Property 

No centralized trust 

Easy to deploy 

Easy to apply to existing protocols 

Immediate consistency 

No assumptions about protocol’s determinism 
No additional online assumptions 

Additional communication overhead per protocol 
message, with witness sets of size W 


Accountability layer Trusted module 


PeerReview [13] | Nysiad [14] | A2M [7] | TrInc 
po 
a ee ee ee ee ee 
a ee ee A ee 
po 
Tl ld 
ol 
O(W?) O(W?) O(1) O(1) 


Table 1: Summary of the properties of various equivocation-fighting systems. *While PeerReview and Nysiad do not 
require centralized trust, they do make use of a PKI. 'Nysiad deals with nondeterminism by treating nondeterministic 
events as inputs; this requires protocol changes for nondeterministic state machines. ‘We found that, although TrInc 
requires a protocol redesign, the modifications are often localized, and vastly simplify security procedures. 


2 Background and Related Work 


2.1 Equivocation in distributed systems 


Since 1982, it has been known that tolerating f Byzan- 
tine faults requires at least 3f + 1 participants [20]. This 
stands in marked contrast to the case for f stopping 
faults, which more intuitively requires 2f + 1 partici- 
pants. A key insight behind A2M [7] was the observation 
that a single property of Byzantine faults is responsible 
for the difference between these two bounds. That prop- 
erty 1S equivocation, meaning the ability to make con- 
flicting statements to different participants. A2M pro- 
vides a mechanism that prevents participants from equiv- 
ocating, thereby improving the fault tolerance of Byzan- 
tine protocols to f out of 2f + 1. 


We make the further observation that equivocation is a 
necessary property for many forms of cheating and fraud, 
not merely for classical Byzantine faults. For instance, 
in BitTorrent, recent work [21] has shown an exploit in 
which a peer can obtain an unfairly high download rate 
by lying about which chunks of a file it has received. 
This is equivocation, insofar as the peer acknowledges 
receiving a chunk from the peer that provided it, but then 
tells another peer that it does not have the chunk. 


The following are three more brief examples: 


e In a simultaneous-turn game, one can cheat by ob- 
serving an opponent’s move before making one’s 
own move; this is equivocating about whether one 
has yet moved. 

e In a distributed electronic currency system, one can 
counterfeit money by equivocating to different pay- 
ees about whether one has spent a particular bill. 

e In an election, the tallier can disrupt the vote by 
equivocating to a voter and an official about whether 
the voter’s vote was recorded. 


In 85.5, we will consider many other cases of mali- 
cious behavior that can be interpreted as equivocation. 
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2.2 Prior solutions to equivocation 


Several recent efforts have addressed the problem of 
Byzantine faults in distributed systems. Although their 
approaches to the problem are very different, they have 
all effectively focused on the issue of equivocation. Ta- 
ble 1 summarizes our analysis of their properties. 


PeerReview [13] is a system that employs witnesses to 
collect a tamper-evident record of all messages in a dis- 
tributed system for subsequent checking against a refer- 
ence implementation. Unlike the remaining approaches 
we will discuss, PeerReview does not provide fault toler- 
ance. Instead, it provides eventual fault detection and 
localization, which the system’s designers argue leads 
to fault deterrence. The tamper-evident record is a dis- 
tributed collection of logs that are authenticated using 
hash chains. The purpose of the tamper-evidence is to 
detect equivocation about the messages recorded in a 
log. As shown in Table 1, the communication required 
to collectively manage the tamper-evident message log 
is quadratic in the size of the witness set. 


Nysiad [14] is a mechanism that transforms crash- 
tolerant distributed systems into Byzantine-fault-tolerant 
ones. It does this by assigning a set of guards (compara- 
ble to witnesses) to each host in the system. The guards 
validate the messages sent by their associated hosts, us- 
ing replicas of the hosts’ execution engines. The po- 
tential for equivocation in Nysiad is that the host might 
send different messages to different guards or order its 
messages differently for different guards. To deal with 
this equivocation, the guards gossip among each other to 
agree on the order and content of messages sent by the 
host. As shown in Table 1, this gossip requires a count 
of messages that is quadratic in the number of guards. 
Relative to PeerReview, Nysiad has the benefit of imme- 
diate consistency, rather than eventual detection. Nysiad 
is also able to handle nondeterministic state machines, 
but doing so requires protocol changes to treat nondeter- 
ministic events as inputs. 


Attested Append-only Memory, or A2M [7], is a 
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trusted module that is embedded in an untrusted ma- 
chine, for the purpose of improving the fault tolerance of 
a distributed protocol. The A2M module provides the ab- 
straction of a trusted log, which the machine can append 
to but not otherwise modify. This limitation prevents the 
machine from equivocating about whether it performed 
a particular action at a particular step, because once the 
action is recorded in the log, it cannot be overwritten. 
A2M uses cryptography to enforce its properties and to 
attest the log’s contents to other machines. Relative to 
Nysiad and PeerReview, A2M does not require any addi- 
tional online communication between machines beyond 
what is required in the base protocol. Consequently, the 
communication overhead is merely a constant factor due 
to the cryptographic attestations that accompany the pro- 
tocol’s messages. 

As we will show in 83, TrInc is significantly smaller 
than A2M, making it easier to deploy. TrInc also has 
another advantage, namely that its use is less tightly 
coupled to the distributed protocol than use of A2M is. 
Specifically, because A2M’s trusted log has finite stor- 
age, it provides a log-truncation operation, but opportu- 
nities to truncate the log may be limited by the protocol. 
Conversely, message sequencing in the protocol may be 
constrained by the available space in A2M’s log. Perhaps 
in part to address this concern, A2M considered various 
implementations in addition to hardware, some of which 
would likely have plentiful storage for the log. These i1n- 
clude a remote service, a software-isolated process, and 
a memory-isolated virtual machine. By contrast, the pro- 
tocol modifications required to use TrInc tend to be quite 
localized. Furthermore, TrInc’s use of a shared session 
key often simplifies the protocol. 


2.3 Trusted hardware 


There have been many trusted hardware designs that 
predate both TrInc and A2M. Perhaps most similar 
to TrInc is the abstraction of virtual monotonic coun- 
ters [34]. These are similar to the four increment- 
only counters included in the current specification of 
the TPM [38]. Van Dik et al. propose an algorithm 
by which to emulate multiple counters with a single 
trusted counter [39]. We believe a similar approach 
could ease TrInc’s deployment by requiring fewer physi- 
cal counters. Further, other systems have been proposed 
that make use of trusted hardware, such as for securing 
database systems [26] and auctions [31]. To the best of 
our knowledge, TrInc is the first trusted component de- 
signed to be used in large-scale, distributed systems. 


3 TriInc Design 
3.1 Design Goals 


The fundamental security goal of TrInc is to remove 
participants’ ability to equivocate. Consider the situation 
in which Mallory wishes to send conflicting messages 
to Alice and Bob. Common approaches to combating 


such equivocation require Alice and Bob to communi- 
cate with one another [13, 14, 20] or with a third party, 
so they can learn of the distinct messages sent to each. 
Unfortunately, this additional communication overhead 
can become a bottleneck for the overlying system, and 
constitutes the super-linear number of messages in Peer- 
Review [13]. 

One goal of TrInc is to therefore minimize both com- 
munication overhead and the number of non-faulty par- 
ticipants required. With trusted hardware, it is possible to 
remove Mallory’s ability to equivocate without any com- 
munication between Alice and Bob [7]. 

The other broad goal of TrInc is to be practical for dis- 
tributed systems today. To be practical, a trusted com- 
ponent must be small so that it is feasible to manufacture 
and deploy. Arbitrary computation and a large amount of 
storage are difficult and costly to make tamper-resistant. 
Further, to be a practical primitive in distributed systems, 
the trusted component must have an API with which it is 
easy to build distributed systems. 


3.2 Overview 


To gain the benefits of TrInc, a user must attach a 
trusted piece of hardware we call a trinket to his com- 
puter. Unlike a typical TPM, which must attest to states 
of the associated computer, the trinket’s API depends 
only on its internal state, so the trinket does not need 
access to the state of the computer. All it needs is an un- 
trusted channel over which it can receive input and pro- 
duce output, so even USB is quite sufficient. 

When Mallory wishes to send a message m to AI- 
ice, she must include an attestation from her trinket that 
(1) binds m to a certain value of a counter, and (2) en- 
sures Alice that no other message will ever be bound to 
that value of that counter, even messages sent to other 
users. A trinket enables such attestation by using a 
counter that monotonically increases with each new at- 
testation. In this way, once Mallory has bound a message 
m to a certain counter value c, she will never be able to 
bind a different message m’ to that value. 

As we show in our case studies in 85, some protocols 
benefit from using multiple counters. In theory, any- 
thing done with multiple counters can be done with a 
single counter, but multiple counters allow certain per- 
formance optimizations and simplifications, such as as- 
signing semantic meaning to a particular counter value. 
Furthermore, the user of a trinket may participate in mul- 
tiple protocols, each requiring its own counter or coun- 
ters. Therefore, a trinket provides the ability to allo- 
cate new counters. However, we must identify each of 
them uniquely so that a malicious user cannot create a 
new counter with the same identity as an old counter 
and thereby attest to a different message with the same 
counter identity and value. 

As a performance optimization, TrInc allows its attes- 
tations to be signed with shared symmetric keys, which 
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vastly improves its performance over using asymmetric 
cryptography or even secure hashes. To ensure that par- 
ticipants cannot generate arbitrary attestations, the sym- 
metric key is stored in trusted memory, so that users can- 
not read it directly. Symmetric keys are shared among 
trinkets using a mechanism that ensures they will not be 
exposed to untrusted parties. 


3.3. Notation 


We use the notation (x) « to mean an attestation of x 
that could only be produced by an entity knowing Kk. If 
K is asymmetric key, then this attestation can be verified 
only by entities that know K; if K is a private key, then 
this attestation can be verified by anyone, or more accu- 
rately anyone who knows the corresponding public key. 
We use the notation {2}, to mean the value x encrypted 
with public key kK, so that it can only be decrypted by 
entities knowing the corresponding private key. 


3.4 TrInc state 


Figure 1 describes the full internal state of a trinket, 
which we describe in more detail here. Each trinket is 
endowed by its manufacturer with a unique identity J and 
a public/private key pair (Kpub, Kpriv). Typically, J will 
be the hash of A,up. The manufacturer also includes in 
the trinket an attestation A that proves the values J and 
pub belong to a valid trusted trinket, and therefore that 
the corresponding private key is unknown to untrusted 
parties. 

We leave open the question of what form A will take. 
This attestation is meant to be evaluated by users, not by 
trinkets, and so can be of various forms. For instance, 
it might be a certificate chain leading to a well-known 
authority trusted to oversee trinket production and ensure 
their secrets are well kept. 

Another element of the trinket’s state is the meta- 
counter MI. Whenever the trinket creates a new counter, 
it increments MM and gives the new counter identity M/. 
This allows users to create new counters at will, with- 
out sacrificing the non-monotonicity of any particular 
counter. Because M only goes up, once a counter has 
been created it can never be recreated by a malicious user 
attempting to reset it. 

Yet another element is Q, a limited-size FIFO queue 
containing the most recent few counter attestations gen- 
erated by the trinket. It is useful for allowing users to 
recover from power failures, as we will describe later. 

The final part of a trinket’s state is an array of counters, 
not all of which have to be in use at a time. For each in- 
use counter, the state includes the counter’s identity 2, its 
current value c, and its associated key K. The identity 
2 1S, aS described before, the value of the meta-counter 
when the counter was created. The value c is initialized 
to 0 at creation time and cannot go down. The key K 
contains a symmetric key to use for attestations of this 
counter; if K = QO, attestations will use the private key 
priv instead. 
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Global state: 


Unique private key of this trinket 
Public key corresponding to K priv 


M Meta-counter: the number of counters 
i this trinket has created so far 
a Limited-size FIFO queue containing the 


most recent few counter attestations gen- 
erated by this trinket 





Per-counter state: 


1 Identity of this counter, i.e., the value of 
f_[itiitenitascwated 
C Current value of the counter (starts at 0, 
> ___| sesckaieny aiiasoatings — 
Key to use for attestations, or 0 if K priv 
should be used instead 


Figure 1: State of a trinket 


3.5 Trinc API 

Figure 2 shows the full API of a trinket, described in 
more detail in this subsection. 
3.5.1 Generating attestations 

The core of TrInc’s API is Attest. Attest takes 
three parameters: 7, c’, and h. Here, 7 is the identity of 
a counter to use, c’ is the requested new value for that 
counter, and / is a hash of the message m to which the 
user wishes to bind the counter value. Attest works as 
follows: 








Algorithm 1 Attest(i, c’, h, n) 
1. Assert that 2 is the identity of a valid counter. 
2. Let c be the value of that counter, and K be the key. 
3. Assert no roll-over: c < c’. 
4. If kK # 0, then let a — (J,1,c,cC,h)xK; otherwise 
let a1 Ce phe ae 

. Insert a into Q, kicking out oldest value. 

. Updatec<« c’. 

7. Return a. 


NN 


Note that Attest allows calls with c’ = c. This is 
crucial to allowing peers to attest to what their current 
counter value is without incrementing it. To allow for 
this while still keeping peers from equivocating, TrInc 
includes both the prior counter value and the new one. 
One can easily differentiate attestations intended to learn 
a trinket’s current counter value (c = c’) from attesta- 
tions that bind new messages (c < c’). 

3.5.2 Verifying attestations 

Suppose a user Alice with trinket A wants to send a 

message to user Bob with trinket 6. She first invokes 
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Attest(i, c’, h) 


Verifies that 7 is a valid counter with some value c and key K. Verifies 


that c < c’. Creates an attestation a = (COUNTER, I,1,c, c,h) x; if 
Kk =0, uses Kpriy instead of Kk. Adds a to Q. Sets c = c’. Returns a. 


GetCertificate(Q) Returns a certificate of this trinket’s validity: (7, Kpup, A). 


CheckAttestation(a, 2) 


Returns a boolean indicating whether a is the output of invoking 


Attest ona trinket using the same symmetric key as the one associated 


with counter 7. 
CreateCounter() 
Returns 2. 


Increments (1/7. Creates a new counter with? = W@,c=0, and k = 0. 


FreeCounter(t) If 7 1s the identity of a valid counter, deletes that counter. 


ImportSymmetrickey(S,72) 


Verifies that S is an encrypted symmetric key decryptable with Kpriy. 


Decrypts it and installs the included key as K for counter 7. 


Get RecentAttestations( 


Figure 2: API of a trinket 


Attest on her trinket using the message’s hash, and 
thereby obtains an attestation a. Next, she sends the mes- 
sage to Bob along with this attestation. However, for Bob 
to accept this message, he needs to be convinced that the 
attestation was created by a valid trinket. There are two 
cases to consider: first, that the attestation used A’s pri- 
vate key Kk ae and second, that the attestation used a 
shared symmetric key Kk. 

In the first case, the API call Get Certificate will 
be useful. This call returns a certificate C of the form 
(I, Kpub, A), where J is the trinket’s identity, Kpup is 
its public key, and A is an attestation that J and Kpup 
belong to a valid trinket. Alice can call this API routine 
and send the resulting certificate C4 to Bob. Bob can 
then (a) learn Alice’s public key ie and (b) verify 
that this is a valid trinket’s public key. After this, he can 
verify the attestation Alice attached to her message, and 
any future attestations she attaches to messages. 


In the second case, the API call 
CheckAttestation is __ useful. When 
CheckAttestation(a, 2) is invoked on a trin- 


ket, the trinket checks whether a is the output of 
invoking Attest ona trinket using the same symmetric 
key as the one associated with the local counter 7. It 
returns a boolean indicating whether this is so. So, if 
Alice sends Bob an attestation signed with a shared 
symmetric key, Bob can invoke CheckAttestation 
on his trinket to learn whether the attestation is valid. 


3.5.3. Allocating counters 


Since a trinket may contain many counters, another 
important component of TrInc’s API is the creation of 
these counters. TrInc creates new logical counters, and 
allows counters to be deleted, but never resets an ex- 
isting counter. Logical counters are identified by a 
unique ID, generated using a non-deletable, monotonic 
meta-counter M. Every trinket has precisely one meta- 
counter, and when it expires, the trinket can no longer be 
used; we compensate for this by making M 64 bits, only 
incrementing /, and assigning no semantic meaning to 





M’s value. TrInc exports a Creat eCounter function 
that increments /; allocates a new counter with identity 
2 = M, initial value O, and initial key K = O; and re- 
turns this new identity 7. When the user no longer needs 
the counter, she may call FreeCounter to free it and 
thereby provide space in the trinket for a new counter. 
3.5.4 Using symmetric keys 

TrInc allows its attestations to be signed with shared 
symmetric keys, which vastly improves its performance 
over using asymmetric cryptography or even secure 
hashes. When a set of users are willing to use a single 
symmetric key for a certain purpose, we call this a ses- 
sion. Creating a session requires a session administrator, 
a user trusted by all participants to create a session key 
and keep it safe, 1.e., to not reveal it to any untrusted par- 
ties. 

To create a session, the session administrator simply 
generates a random, fresh symmetric key as the session 
key K. To allow a certain user to join the session, he 
asks that user for his trinket’s certificate C. If the session 
administrator is satisfied that the certificate represents a 
valid trinket, he encrypts the key in a way that ensures 
it can only be decrypted by that trinket. Specifically, he 
creates {KEY, kK}, where Kyup is the public key in 
C. He then sends this encrypted session key to the user 
who wants to join the session. 

Upon receipt of an encrypted session key, the user can 
join one of his counters to the session by using the API 
call ImportSymmetricKkey(S,7i). This call checks 
that S is a valid encrypted symmetric key, meant to be 
decrypted by the local private key. If so, it decrypts the 
session key and installs it as AK for local counter 7. From 
this point forward, attestations for this counter will use 
the symmetric key. Also, the user will be able to verify 
any trinket’s attestation a using this symmetric key by 
invoking CheckAttestation(a, 2). 


3.5.5 Handling power failures 


One practical concern is that of power failure. Unlike 
A2M, TrInc users need not query the trusted hardware to 
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obtain attestations. Instead, TrInc relies on the applica- 
tion (or a TrInc driver) to store attestations in untrusted, 
persistent storage. If there is a power failure between the 
time that the trinket advances its counter and the appli- 
cation writes it to disk, then the attestation is lost. This 
can be problematic for many protocols, which rely on 
the user being able to attest to a message with a particu- 
lar counter value. For instance, if Charlie cannot produce 
an attestation for counter value v, Alice may suspect this 
is because Charlie has already told Bob about some mes- 
sage m associated with that counter value. Not wanting 
to be fooled about the absence of such a message, Alice 
may lose all willingness to trust Charlie. 


To alleviate this, a trinket includes a queue Q contain- 
ing the most recent attestations it has created. To limit 
the storage requirements, this queue only holds a certain 
fixed number & of entries, perhaps 10. In the event of 
a power failure, after recovery the user can invoke the 
API call Get RecentAttestations to retrieve the 
contents of Q. Thus, all a user must do to protect against 
power failure is make sure she writes a needed attestation 
to disk before she makes her ‘th next attestation request. 
As long as k is at least 1, the user can safely use the trin- 
ket for any application. Higher values of / are useful as 
a performance optimization, allowing greater pipelining 
between writing to disk and submitting attestations. 


So far we have only discussed a power failure affect- 
ing the user, but a power failure can also affect the trin- 
ket. The Attest algorithm ensures that the attestation 
is inserted into the queue before the counter is updated, 
so the trinket cannot enter a situation where the counter 
has been updated but the attestation is unavailable. It 
can, however, enter the dangerous situation in which the 
attestation is in Q, and thus available to the user, but the 
counter has not been incremented. This window of vul- 
nerability could potentially be exploited by a user to gen- 
erate multiple attestations for the same counter value, if 
he could arrange to shut off power at precisely this inter- 
vening time. However, we guard against this case by hav- 
ing the trinket check @ whenever it starts up. At startup, 
before handling any requests, it checks all attestations in 
() and removes any that refer to counter values beyond 
the current one. 


3.5.6 A TriInc by any other name 


The computational demands of a trinket are small. It 
must be able to do simple operations such as comparison, 
as well as cryptographic operations including hashing 
and both symmetric and asymmetric encryption and de- 
cryption. Such cryptographic operations are standard in 
trusted components such as the TPM [38]. However, we 
recognize that hardware manufacturers and users are of- 
ten highly cost-conscious and may be willing to do with- 
out performance optimization to save hardware costs. 


Therefore, we propose three versions of TrInc that 
make different trade-offs between cost and performance, 
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Persistent | Asym. | Symm. Fast 
Memory | Crypto | Crypto | Memory 
Bronze TrInc v Vv 
Silver TrInc v v Vv 
Gold TrInc v v v Vv 


Table 2: Versions of TrInc with different performance. 


summarized in Table 2. The bronze version simply of- 
fers correctness with no performance optimizations, by 
leaving out the ability to use symmetric keys. The silver 
version is as we have described it. The gold version adds 
one additional optimization: the use of fast persistent 
memory such as battery-backed RAM. This optimization 
makes attestations especially fast since they need not in- 
cur the cost of writing to the slow flash memory often 
found in modern TPMs. 


3.6 Local adversaries 


Mutually distrusting principals on a single computer 
will share access to a single trinket, creating the potential 
for conflict between them. Although they cannot equiv- 
ocate to remote parties, they can hurt each other. They 
can impersonate each other by using the same counter, 
and they can deny service to each other by exhausting 
shared resources within the trinket. Resource exhaustion 
attacks include allocating all available counters, submit- 
ting requests at a high rate, and rapidly filling the queue 
() to prevent the pipelining performance optimization. 

The operating system can solve this problem by me- 
diating access to the trinket, just as it mediates access to 
other devices. In this way, the OS can prevent a princi- 
pal from using counters allocated to other principals, and 
can use rate limiting and quotas to prevent resource ex- 
haustion. Developing a detailed API and policy for such 
mediation is beyond the scope of this paper, and is left for 
future work. However, note that a remote party need not 
care about how or whether such local mediation is done. 
Equivocation to remote parties is impossible, even if an 
adversary has root access to the machine, since cryptog- 
raphy allows the trinket to communicate securely even 
over an untrusted channel. 


4 Analysis of TrInc 


We now present a brief discussion of why TrInc 1s suf- 
ficient for a broad class of distributed protocols and why 
it is nearly minimal in size. 


4.1 Equivocation 

When a trinket creates an attestation with distinct old 
and new counter values of c and c’, we say that attes- 
tation covers the half-open interval (c,c‘]. TrInc pre- 
vents equivocation by ensuring that no two attestations 
will cover overlapping intervals. This property could be 
violated only if: 


e the counter is decremented, 
e the cryptosystem is broken, 
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e more than one counter has the same identity, or 
e more than one trinket has the same identifier. 


By construction, it is not possible to decrement the 
counter nor to assign the same identity to multiple coun- 
ters. By hypothesis, cryptographic primitives are effec- 
tively unbreakable. Finally, no two trinkets will be cre- 
ated with the same identifier, at least not by a trusted 
manufacturer; recall that users can verify whether the 
trinket comes from a trusted manufacturer by observing 
the certificate chain in A. 


4.2 Timeliness 


When a trinket creates an attestation with the same old 
and new counter values, there is no change to the trin- 
ket’s state; however, the attestation demonstrates the cur- 
rent value of the counter. Thus, if a machine attests to a 
value of a remotely supplied nonce, the remote machine 
can be certain that the attestation was generated after the 
nonce was supplied. Since this attestation carries the cur- 
rent counter value, the remote machine can thus also be 
sure that the local machine’s counter is no lower than this 
value. 

Therefore, when the local machine provides attesta- 
tions of counter values up to the nonce-attested value, 
the remote machine can be certain that these attestations 
are timely. 


4.3. Minimality 


Suppose, during the execution of a protocol, a partic- 
ipant sends n messages requiring attestation, but her at- 
testing module has fewer than log,(n) bits of storage. 
The attesting module must be willing to provide all n 
attestations, or else it will cause the protocol to halt pre- 
maturely. However, since the module can be in fewer 
than n distinct states, by the pigeonhole principle it must 
be willing to attest to two different messages while in the 
same state. Since this state is as it was before the first 
message, it cannot reflect the trinket’s having attested to 
the first message. This means a malicious user could take 
advantage of the trinket’s inability to remember its first 
attestation when requesting the second attestation, and 
thereby obtain an attestation inconsistent with the ear- 
lier one. This is clearly inconsistent with the goals of a 
trusted module, so we come to a contradiction, and con- 
clude that such a module requires at least log,(n) bits 
of storage. In other words, it needs sufficient storage to 
accommodate a message counter. 

Furthermore, an attesting module needs for its attesta- 
tions to be unforgeable. Otherwise, the user could gen- 
erate attestations without using the module, and thereby 
attest to both sides of an equivocation. TrInc achieves 
this unforgeability with simple cryptographic primitives. 

In summary, the core components of TrInc, a counter 
and cryptography, seem to be essential for equivocation 
prevention. 


5 Designing Systems with TrInc 


5.1 Overview 

When designing a protocol that incorporates TrInc, we 
find it important to address the following questions: 
5.1.1 What does TrInc’s counter represent? 

In the applications we have considered, TrInc’s 
counter represents a natural “progression” of the sys- 
tem. In BitTorrent, for instance, the counter represents 
the number of blocks a given peer has received, a value 
which is naturally monotonically increasing. In Byzan- 
tine Fault Tolerance (BFT), the counter represents which 
view a replica is in. Ultimately, the choice of what the 
counter represents is dependent on what data peers will 
need to attest to. 

5.1.2 To what data do peers attest? 

There are two broad types of attestations that TrInc of- 
fers. Advance attestations increase the trinket’s counter, 
thus binding a message to a counter. Status attestations 
attest to the current counter without advancing it. 


Advance attestations Advance attestations are largely 
protocol-dependent, including such elements as the set of 
pieces received in BitTorrent, or the root of a Merkle tree 
of file hashes in a file server. The specific data to which 
to attest often requires a careful analysis of the selfish 
or malicious ways in which peers could equivocate. It 
is important to ensure that the impossibility of equivo- 
cating about what was assigned to a particular counter 
value translates into the impossibility of equivocating at 
the higher desired semantic level. 

For instance, suppose an attestation consists solely of 
a number n of pieces received in BitTorrent and a list of 
n peers. In this case, a participant Mallory can cheat in 
the following way. After receiving the first piece a from 
Alice, she replies with an attestation that her one-piece 
set contains only a. Next, after receiving her next two 
pieces 6 from Bob and c from Charlie, she sends them 
both an identical attestation that her two-piece set is b 
and c. In this way, Mallory gets away with hiding the 
fact that she has received piece a, despite not being able 
to get different attestations for the same value of n = 2. 
As we will see later, in 85.4, we prevent this by having 
an attestation include the last piece received. 


Status attestations Most distributed systems do not 
have an implicit system-wide “counter.” Rather, peers 
progress at varying rates: BitTorrent peers download at 
rates largely dependent on their own upload rates, DHT 
peers store varying amounts of data, and so on. Sta- 
tus attestations enable peers to determine others’ current 
counter values. The data in a status attestation is gen- 
erally a nonce, to ensure freshness in peers’ reports of 
their counters. Coupled with a counter that has semantic 
meaning, status attestations can provide peers with up- 
to-date information about their neighbors. In BitTorrent, 
for instance, knowing how much of a file a neighbor has 
downloaded can help determine whether to bootstrap him 
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Algorithm 2 Implementation of A2M with TrInc 
Init() 


1. Create low and high counters: 


£4 — CreateCounter(); Hg < CreateCounter() 


2. Return {L,, H_} 


Append(queue gq, value x) 


1. Bind h() to a unique counter (the current “high counter’): 


a — Attest(Hy.id, H,.ctr+1, h(x)) 


2. Store the attestation in untrusted memory: 


qg.append(a, x) 


Lookup(queue qg, sequence number n, nonce z) 


End(queue g, sequence number n, nonce z) 
1. Retrieve the latest entry from the given log: 
{a,x} — g.end() 
2. Attest that this is the latest entry with a high- 
counter attestation of the supplied nonce: 
a — Attest(Hy.id, Hg.ctr, z) 
3. Return {a’, {a, x}} 
Truncate(queue g, sequence number 7) 


1. Remove the entries from untrusted memory: 


qg.truncate(n) 


1. If nm < Lg, the entry was truncated. Attest to this by returning an 


attestation of the supplied nonce using the low-counter: 
Attest(L,.id, Ly.ctr, h(FORGOTTEN||z)) 


2. If n > Hg, the query is too early. Attest to this by returning an 


attestation of the supplied nonce using the high-counter: 
Attest(H,.id, Hg.ctr, h(TOOEARLY||z)) 


3. Otherwise, return the entry in q that spans 7, 1.e., the one such that 
a.c<n<a.c’. Note that if n < a.c’, this means n was skipped 


by an Advance. 


with free pieces (because he is new to the swarm) or to 
initiate a trade with him (because he has many interesting 
pieces of the file). 


5.2 Case study 1: A2M 


Attested Append-only Memory (A2M) [7] is another 
proposed trusted hardware design with the intent of com- 
bating equivocation. A2M offers trusted logs, to which 
users can only append. The fundamental difference be- 
tween the designs of A2M and TrInc are in the amount 
of state and computation required from the trusted hard- 
ware. To demonstrate that TrInc’s decreased complex- 
ity is enough, we present, as our first case study, how to 
build A2M using TrInc. 


5.2.1 A2M overview 


A2M’s state consists of a set of logs, each contain- 
ing entries with monotonically increasing sequence num- 
bers. A2M supports operations to add (append and 
advance), retrieve (Lookup and end), and delete 
(truncate) items from its logs. The basis of A2M’s re- 
silience to equivocation is append, which binds a mes- 
sage to a unique sequence number. For each log g, A2M 
stores the lowest sequence number, £,, and the highest 
sequence number, #/,, stored in g. A2M appends an en- 
try to log q by incrementing the sequence number HH, 
and setting the new entry’s sequence number to be this 
incremented value. The low and high sequence numbers 
allow A2M to attest to failed lookups; for instance, if a 
user requests an item with sequence number s > H,, 
A2M returns an attestation of 7,. 
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2. Move up the low counter: 
a+ Attest(Lyq.id, n, FORGOTTEN) 


Advance(queue g, sequence number n, value x) 
1. Append a new item with sequence number n: 
a+ Attest(Hq.id, n, h(x)) 


2. Store the attestation in untrusted memory: 


qg.append(a, x) 


5.2.2 Trusted logs with TrInc 


In our TrInc-based design of A2M, we store logs in 
untrusted memory, as opposed to within a trinket. As in 
A2M, we make use of two counters per log, representing 
the highest (#1,) and lowest (£,) sequence number in the 
respective log q. 

We present the detailed protocol in Algorithm 2, and 
summarize some of its characteristics here. Note the 
power of TrInc’s simple API; our design is built predom- 
inately on calls to a trinket’s Attest function. Our pro- 
tocol uses advance attestations for moving the high se- 
quence number when appending to the log, and for mov- 
ing the low sequence number when deleting from the log. 
We perform status attestations of the low counter value to 
attest to failed lookups, and of the high counter to attest 
to the end of the log. No additional attestations are nec- 
essary for a successful lookup, even if the Lookup 1s 
to a skipped entry. Conversely, A2M requires calls to the 
trusted hardware even for successful lookups. 


5.2.3. Properties of a TrInc-based A2M 


Chun et al. [7] demonstrate how to apply A2M to 
BFT [20], SUNDR [22], and Q/U [1]. Our implemen- 
tation of A2M in TrInc demonstrates that TrInc, too, can 
be applied to these systems. 

Implementing trusted logs using TrInc has several 
benefits over a completely in-hardware design like A2M. 
Because TrInc stores the logs in untrusted storage, we 
decouple the usage demand of the trusted log from the 
amount of available trusted storage. Conversely, lim- 
ited by the amount of trusted storage, A2M must make 
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more frequent calls to t runcate to keep the logs small. 
Some systems, such as PeerReview [13], benefit from 
large logs, making TrInc a more suitable addition, which 
we consider next. 


5.3. Case study 2: PeerReview 


Accountability systems, such as PeerReview [13] and 
Nysiad [14], strive to augment existing protocols to make 
them tolerant to Byzantine faults. This is a powerful ap- 
proach, as it allows system designers to focus on the sys- 
tem at hand, rather than consider Byzantine faults at all 
layers of the system. The general approach is to have par- 
ticipants in the system communicate with and audit one 
another, resulting in what is sometimes, unfortunately, a 
massive amount of additional communication overhead. 

Our main observation in this case study is that the 
means by which these systems combat equivocation con- 
stitutes the bulk of their communication overhead. By 
applying TrInc to PeerReview, we are able to vastly re- 
duce PeerReview’s communication overhead. 

5.3.1 PeerReview review 

PeerReview [13] is a system that enables accountabil- 
ity in general distributed protocols. Unlike BFT, which 
ensures that bad behavior never has an effect, PeerRe- 
view allows bad behavior to affect the system but ensures 
that the improper act will eventually be detected. This al- 
lows a system to correct for bad behavior after the fact, 
and also deters bad behavior to begin with. 

PeerReview works on any protocol in which each par- 
ticipant acts according to a deterministic state machine. 
PeerReview assigns each participant a set of witnesses, 
machines whose job it is to detect bad behavior by that 
participant. The participant is required to log all of the 
messages it sends and receives, and report these to the 
witnesses. The witnesses then run the participant’s state 
machine to ensure the participant’s outgoing messages 
were consistent with proper operation. 

A participant might try to cheat by sending different 
messages to the witnesses than it sends to other partic- 
ipants. For this reason, when a participant receives a 
message from another, it forwards this message to the 
sender’s witnesses, so they can ensure this message actu- 
ally appears in the sender’s log. 

As a practical matter, full messages do not have to be 
transmitted to witnesses thanks to the use of a tamper- 
evident log. Each log entry is associated with a sequence 
number, and the log itself is represented by a recursive 
hash reflecting all log entries. When a participant sends 
a message, it includes a signed statement that this mes- 
sage has a particular sequence number and that the log 
had a particular recursive hash when this message was 
logged. In this way, the receiver only needs to report this 
authenticator to the witness. 

PeerReview’s tamper-evident log has another impor- 
tant use. When a participant or witness discovers bad 
behavior in a participant, the authenticators signed by 


the malefactor stand as clear proof of the misbehavior. 
Thus, a faulty witness cannot improperly accuse a par- 
ticipant, and an incompletely trusted witness can be be- 
lieved when it presents evidence of a participant’s mis- 
behavior. 
5.3.2 Simplifying PeerReview with TrInc 

By augmenting PeerReview with TrInc, we are able to 
simplify much of PeerReview’s protocol. We detail here 
the modifications we make to PeerReview in augmenting 
it with TrInc. 


Trusted logs As demonstrated with A2M, TrInc can 
easily supply a trusted log without the assistance of a 
witness set. Our first modification is to include such a 
trusted log. Whenever a participant sends or receives a 
message, it logs that message with an attestation from 
its trinket. A participant should only process a received 
message if it is accompanied by an attestation that the 
message has been logged by the sender’s trinket. 


Audits Each witness w for a participant p keeps track of 
n, a log sequence number, and s, the state that p should 
have been in after sending or receiving the message in 
log entry n. It initializes n to 0 and s to the initial state 
of participant p. 

Whenever w wants to audit p, it sends it n and a nonce. 
The participant returns an attestation of its current log en- 
try number n’ using the nonce, and also returns a log en- 
try and attestation for every index i such thatn <i <n’. 
Note that witnesses need only obtain these entries di- 
rectly from p, and not from other peers with whom p has 
communicated. The witness then runs the reference im- 
plementation, starting at state s, and progressing through 
the log entries between n and n’. If the reference imple- 
mentation sends the same messages that are in the log, 
then the witness simply updates n to n’ and updates s 
to the state of the reference implementation at that point. 
If not, then the witness has proof it can present of the 
participant’s failure to act properly. 

5.3.3. Properties of a TrInc-enabled PeerReview 

The benefits from applying TrInc to PeerReview are 
evident when considering what the protocol no longer 
has to do. 


Challenge/response Enabled with TrInc, PeerRe- 
view’s challenge/response protocol is no longer needed 
for a participant to verify a hash chain of log entries. The 
fact that TrInc signs the messages is sufficient. The only 
time a participant z has to challenge another participant 7 
is when it sends participant 7 a message and receives no 
acknowledgment of it. In this case, the challenge works 
as in regular PeerReview. 


Consistency ‘TrInc further removes the need for 
witness-to-witness communication. In PeerReview, if p 
receives an authenticator from q, then p’s witnesses must 
forward it to qg’s witnesses. This is not necessary in a 
TrInc-augmented PeerReview because there would be no 
way for those other participants to avoid sending the au- 
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thenticators themselves to their witnesses. Another way 
to look at it is that it is not necessary for a participant 
to pass on authenticators it receives to witnesses, so it 1s 
not necessary for a witness to do this on behalf of partic- 
ipants. 

To summarize, we find that by applying TrInc to Peer- 
Review, we are able to vastly decrease the amount of 
communication overhead. We demonstrate this empiri- 
cally in Section 7. 


5.4 Case study 3: BitTorrent 


The previous two systems demonstrate that TrInc is a 
minimal counterpart to a related trusted component, and 
that it can reduce the overhead of achieving accountabil- 
ity in a distributed setting. Our third case study demon- 
strates TrInc’s versatility. We show how TrInc can be 
applied to solving an open incentive problem [21] in the 
immensely popular BitTorrent system [8]. 

5.4.1 A brief overview of BitTorrent 

BitTorrent [8] is a decentralized file swarming system 
whose goal is to disseminate large files to a large num- 
ber of downloaders. Rather than rely on a highly pro- 
visioned server, BitTorrent peers trade small pieces of a 
file with one another, thereby contributing to the system 
while gaining from it. Bitfields represent which pieces of 
a file a peer has. Peers trade bitfields in order to gain one 
another’s interest; a peer is interested in peers who have 
pieces that it does not. Since peers only upload to peers 
in whom they are interested, peers have incentive to be 
as interesting to as many others as possible. 

5.4.2 Piece under-reporting 

BitTorrent peers can sometimes have incentive to 
under-report what pieces they have to their neighbors, 
since by doing so they can limit the degree to which their 
neighbors find interest in one another [21]. For instance, 
suppose peer 7 has neighbors 7 and k, both of whom want 
pieces p and q from 7. If 2 were to tell them both about 
both pieces, one might demand p and the other might de- 
mand q. After obtaining them, they might gain interest 
in one another and exchange p and g among themselves, 
thus decoupling from 7. Thus, 2 may prefer to under- 
report by sending to 7 and k a bitfield that contains p but 
not g. As a result, both neighbors request and obtain p, 
gaining no interest in one another; only then does 7 reveal 
that he also has piece q, forcing 7 and k to download it 
from 2. 

Such under-reporting leads to a tragedy of the com- 
mons, since although strategic under-reporters’ down- 
load times improve, the system as a whole suffers [21]. 
Since its recent discovery, strategic under-reporting has 
yet to be solved; we demonstrate how to solve it with 
TrInc. 

5.4.3 Solving under-reporting with TrInc 

We observe that under-reporting in file swarming sys- 
tems is an act of equivocation. Using the above example, 
when peer 2 received piece g from peer /, 2 must have 
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Algorithm 3 Fighting equivocation in BitTorrent 
Upon receipt of piece p: 

1. Add p to bitfield B 

2. Aeurr — Attest(i,|Bl, h(p, B)) 


Upon sending piece p to neighbor 7: 
1. Request an attestation from 7 with a random nonce. 
2. Do not send any piece other than p to 7 until 7 ad- 
mits to having p. 


Periodically, for each neighbor 7: 
1. Request an attestation of 7’s current bitfield with a 
random nonce. 


Upon receiving an attestation request with nonce z: 
l. Gimp — Attest(i,|Bl, z). 
2. Reply with (Geurr, dimp)- 


sent an acknowledgment, stating to @ that he received the 
piece. However, by under-reporting g to peers 7 and k, 
2 1s effectively contradicting a statement he made earlier 
to £. 

Our goal is therefore to remove BitTorrent peers’ abil- 
ity to undetectably equivocate. We present in Algo- 
rithm 3 a TrInc-based protocol for fighting equivocation 
in BitTorrent. In this protocol, a peer attests to his bit- 
field, incrementing a trinket counter for each piece he 
receives. Also, peers periodically request up-to-date at- 
testations from their neighbors, to maintain fresh state. 

Because they join the swarm at different times and 
download at different rates, peers’ counters are not syn- 
chronized. In Algorithm 3, the TrInc counter does not 
correspond to some system-wide “round” the protocol is 
in, as it would in, say, BFT machine replication. Instead, 
peer 2’s counter represents how many pieces 7 has down- 
loaded. This is a natural fit for the counter, because it is a 
monotonically increasing number, and because the type 
of malicious behavior we want to prevent corresponds to 
pretending it is not monotonic. 

Algorithm 3 demonstrates the importance of choosing 
the correct data to which to attest. Suppose, for instance, 
peers were to attest only to their bitfields. Clearly, when 
s sends an attested bitfield to neighbor n, s must include 
the piece n sent him, p,,, in the bitfield, otherwise n will 
observe an under-report. Were s to attest only to the bit- 
field, then s could under-report as follows, where Boig 
represents the bitfield before receiving pieces p,, py, and 
Dc, and & denotes adding a piece to the bitfield: 





e loa: Bota ®D Da 
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The problem arises because the data to which s is attest- 
ing does not enforce monotonicity at the semantic level 
we desire. Specifically, though the counter cannot de- 
crease, it does not have to correspond to the number of 
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distinct pieces acknowledged, allowing a malicious par- 
ticipant to misstate the number of distinct pieces he has 
acknowledged. 

In our solution, a peer attests not only to the hash of 
his bitfield 6, but also to the most recent piece he has 
received, p. Neighbor n therefore expects an advance at- 
testation including both p,, and a bitfield containing p,,. 
As a result, every piece must have a unique advance at- 
testation, ensuring that s’s counter must be as large as the 
number of pieces he has acknowledged receiving. 

5.4.4 Properties of a TrInc-augmented BitTorrent 

Our TrInc-based solution to equivocation in BitTorrent 
solves two difficult incentives-related problems. First, 
peers have incentive to truthfully reveal the pieces they 
have whenever they are asked to. TrInc removes the abil- 
ity to equivocate, and step-omission failures (remaining 
silent) result in getting no further pieces from a neighbor. 
Peers can therefore obtain long-lived trades with others 
only by truthfully reporting their pieces. 

Second, our solution adds additional security to Bit- 
Torrent’s bootstrapping mechanism. In BitTorrent, peers 
optimistically unchoke new participants, sending them 
pieces without requiring anything in return, to introduce 
them into the system. BitThief [24] exploits this by pre- 
tending not to be able to make progress [35]. However, 
such artifice is not possible with TrInc since with it a peer 
cannot hide the rate at which he is downloading pieces. 

Note, however, that what we propose is not a com- 
plete solution to problems with bootstrapping. Even with 
TrInc-enabled BitTorrent, a peer can steal a single piece 
from each other peer. Our goal of applying TrInc here is 
to ensure truthfulness in long-lived peerings, which (sur- 
prisingly) does not arise automatically. 


5.5 Other applications 


We see many other potential applications for TrInc. 
We briefly described three such apps in Section 2.1: 
simultaneous-turn games, electronic currency, and elec- 
tions. Here, we detail several others: 

Secure DNS is intended to protect the integrity of the 
Internet domain name system. One identified threat [6] 
is that a resolving name server could be compromised 
and forge incorrect responses. The official solution to 
this threat is data origin identification in the DNS Secu- 
rity Extensions (DNSSEC), which uses public-key sig- 
natures to authenticate name updates. However, this so- 
lution does not address a threat in which the compro- 
mised name server replies to a query with out-of-date 
data, which would still bear a valid signature. Modify- 
ing DNSSEC with TrInc could address this problem by 
preventing the resolving name server from equivocating 
about whether it has received an update. Once it ac- 
knowledges receipt to the authoritative name server, it 
can no longer pretend it has not received the update. 

Secure Origin BGP (soBGP) [44] is intended to 
protect the integrity of Internet routing updates. Like 


DNSSEC, soBGP uses public-key signatures to authen- 
ticate updates. Also like DNSSEC, soBGP is vulnerable 
to a threat in which a compromised router advertises out- 
of-date routes, which would still bear valid signatures. 
TrInc could address this problem by preventing a router 
from equivocating about whether it has received a rout- 
ing update. 

Distributed hash tables (DHTs), such as Chord [37], 
Bamboo [33], and Kademlia [27], are vulnerable to mis- 
behaving nodes. In particular, a node can lie about which 
region of the keyspace it is responsible for. As nodes 
join and leave the DHT, these regions of responsibil- 
ity change (sometimes quite rapidly [33]) in response 
to reconfiguration messages. A node can equivocate 
about whether it has received a particular message, which 
may allow it to claim responsibility for a region of the 
keyspace it does not own. TrInc could be used to prevent 
this equivocation. 


Version control systems, such as CVS [41] and Sub- 
version [29] are often run on remote servers. Thus, they 
are vulnerable to a threat model in which the server 
presents different views of the repository to different 
clients. Although this threat could be addressed at the 
block-store level [22], it might be more efficient to ad- 
dress it at the application level, in which case TrInc could 
prevent this equivocation. 


Distributed auctions [42] are vulnerable to cheating 
participants. A bidder can try to manipulate others’ bids 
by equivocating about the value of his current bid. An 
auctioneer can try to manipulate the bidding by equiv- 
ocating about her reserve price for a particular auction. 
TrInc could protect against both of these classes of cheat- 
ing, by preventing both bidders and auctioneers from 
equivocating. 

Leader election protocols [25] rely on a quorum of 
participants to agree on a choice of leader. For a quo- 
rum of size q, it can legitimately happen that two groups 
of size g — 1 will nominate different leaders. In this 
case, one participant can equivocate about which leader 
to nominate, causing the protocol to select two leaders 
concurrently. TrInc could be used to prevent this equivo- 
cation. 


Digital signatures are used in many cryptographic 
protocols, but commonly use slow asymmetric key oper- 
ations [17]. However, TrInc allows faster symmetric key 
operations to be used instead. To do so, a signer merely 
has to have his trinket attest to the hash of the message to 
be signed using a shared symmetric key. Since this attes- 
tation can only be generated by a party with access to the 
symmetric key, and since the hardware includes the ID in 
any attestation, no other party (except the trusted session 
administrator) can have generated the attestation. Thus, 
it functions effectively as a digital signature, verifiable 
by anyone whose trinket has the same symmetric key in- 
stalled. 
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jf 


(asymmetric, advance > 0) | 230.24 + 0.28 
(asymmetric, advance = 0) | 198.21 + 0.10 
128.95 + 0.08 
105.90 + 0.08 


Attest 


(symmetric, advance > 0) 
(symmetric, advance = 0) 


Verify Symmetric Attestation 85.81 + 0.11 





Table 3: TrInc microbenchmarks on a Gemalto .NET 
Smartcard, with 95% confidence intervals. 


6 TrInc Implementation 


The application case studies demonstrate the strong 
theoretical properties of TrIncs. In this section, we study 
the performance of TrIncs today. To this end, we have 
implemented TrInc on Gemalto .NET SmartCards [11], 
and present microbenchmarks that measure TrInc’s per- 
formance on these widely available pieces of trusted 
hardware. 


6.1 Microbenchmarks 

Our experimental setup consists of an Intel Core 2 
Duo 1.6GHz machine with 3GB of RAM, and a smart- 
card connected via a USB card reader. We present our 
microbenchmarks in Table 3, with results averaged over 
1,000 runs. In addition to TrInc’s API, we include a noop 
to essentially measure the round-trip time between PC 
and smartcard. 

Compare the Attest results on the card to those 
on the untrusted PC, where 3-DES took 0.017 + 0.008 
msec, and RSA took 8.6 + 0.67 msec. It is no surprise 
that a smartcard does not perform as well, but the dif- 
ference in relative performance between symmetric and 
asymmetric encryption is striking. On the PC, they dif- 
fer by a factor of over 500, while on the card they differ 
by less than a factor of 2. While using symmetric instead 
of asymmetric operations improves TrInc’s performance, 
we were surprised to see it was by this small a factor. 


6.2 Why so slow? 


The conclusion is clear: today’s trusted hardware is 
slow! Indeed, it is much slower than would be allowed 
by most components of a distributed system. But why is 
it slow, and why do current applications that use trusted 
hardware not suffer as a result? 

We believe this is attributable to the fact that Trinc uses 
trusted hardware in a fundamentally different way than 
that for which the hardware is currently designed. To- 
day’s trusted hardware is designed to bootstrap software, 
generally performing few operations during a machine’s 
boot cycle. Conversely, TrInc makes use of trusted hard- 
ware during operation, in some cases multiple times for 
each message sent. 

We proposed several versions in 83.5.6 that we believe 
would be viable directions for future designs of trusted 
hardware to take. In the interim, a logical solution is 
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Time (msec) 
TrInc A2M 


Noop 6.99 + 0.01 


Append 187.60 + 0.15 | 551.93 + 154 
Lookup (Successful) | 0.0122 + 0.02 | 304.14 + 6.87 
Lookup (TooEarly) 162.24 + 0.08 | 289.68 + 2.23 


294.16 + 2.04 


Table 4: TrInc-A2M microbenchmarks, with 95% confi- 
dence intervals. 


to design protocols that limit the number of necessary 
attestations, but such approaches are beyond the scope 
of this paper. Nevertheless, our empirical results in the 
following section indicate that making trusted hardware 
more suitable for use in distributed systems today is a 
valuable area of future work. 


7 Application Evaluation 


We now turn to macrobenchmarks, evaluating TrInc 
as it applies to our three case studies: A2M, PeerReview, 
and BitTorrent. 


7.1 Trinc-A2M 


In Section 5.2, we proposed a way to build A2M 
using TrInc. While demonstrating TrInc’s ease of use 
and versatility, it also allows us to compare the two 
trusted-component designs. To this end, we have im- 
plemented A2M in the Gemalto .NET SmartCard, and 
a TrInc library—run on an untrusted machine—that ac- 
cesses TrInc as prescribed in Algorithm 2. 

We present microbenchmark comparisons in Table 4. 
As expected, TrInc performs Appends much more 
quickly, as it does not require as many writes to trusted 
storage. Where TrInc offers vast speed improvements 
over A2M is in successful Lookups; since these do not 
have to be either stored in trusted hardware or attested, 
they are merely local operations. Interestingly, A2M im- 
proves with Truncate, since A2M simply increases the 
log’s low counter and postpones the attestation of the op- 
eration until a lookup that needs to return FORGOTTEN. 
TrInc amortizes this cost, in the expectation that there 
will be more FORGOTTEN lookups than truncations. 

These results demonstrate that TrInc performs better 
on today’s trusted hardware. As trusted components im- 
prove, particularly in terms of memory writes and cryp- 
tographic operations, it is likely that A2M and TrInc will 
perform comparably well. However, the slowness of to- 
day’s trusted hardware brings to light the difference in 
complexity between A2M and TrInc. We believe TrInc’s 
relative simplicity makes it a more suitable candidate 
even with future designs of trusted hardware. 
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Figure 3: Reduction in PeerReview’s message overhead 
due to TrInc. 


7.2 TriInc-PeerReview 


In Section 5.3, we demonstrated how including TrInc 
into the design of an accountability system such as Peer- 
Review can decrease the amount of communication re- 
quired between participants. This represents one of the 
fundamental strengths of including a small, trusted com- 
ponent into an otherwise untrusted system. 

Applying TrInc to PeerReview removes the require- 
ment for a peer p to communicate with the witness set 
of any other peer g, unless, of course, p happens in q’s 
witness set. Using data from the original PeerReview 
study [13], we demonstrate in Figure 3 the extent to 
which TrInc reduces PeerReview’s communication over- 
head. TrInc effectively removes the O(W7) witness-set- 
to-witness-set communication, for reasons described in 
Section 5.3. As a result, the amount of additional com- 
munication overhead scales linearly rather than quadrat- 
ically with the size of the witness sets. 


7.3 TrInc-BitTorrent 


To evaluate our TrInc-based solution for BitTorrent, 
we simulated using a “gold-standard” trinket in the 
Azureus BitTorrent client. To do so, we modified Bit- 
Torrent’s Have messages to include attestations to coun- 
ters. We observed that Have messages, originally in- 
tended simply to inform others when a peer receives a 
piece, come frequently enough in practice to also satisfy 
peers’ continual need for fresh attestations. 

We modified the BitTorrent code to recognize these 
new messages, and to cut off peers thereby discovered to 
be under-reporting. However, we never have the seeder 
punish a peer in this way. It seems reasonable to have 
such a forgiving seeder since otherwise peers who suf- 
fer failures—for example, from a corrupted disk—could 
never request blocks after they have attested to them. 

We ran our experiments on a local cluster consisting 
of 23 leechers, each with upload bandwidth capped at 
5OKbps, and one seeder, with upload bandwidth capped 
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Figure 4: Rate of progress for various BitTorrent clients 
when TrInc is used. 


at 80Kbps. We chose one host to act as a strategic piece 
revealer using an algorithm from a prior study [21]. We 
chose this host arbitrarily since, on the local cluster, we 
found them to be virtually indistinguishable in terms of 
performance. 

Our experiments demonstrated a clear loss in perfor- 
mance from under-reporting. In a representative run, the 
under-reporting peer took 27% longer to download the 
file than the other peers did on average, and 33% longer 
than the median. 

The under-reporter’s download times would have been 
much worse if not for the forgiving seeder. We show in 
Figure 4 the total number of blocks the under-reporter re- 
ceived over time, compared to the number of blocks he 
received from the seeder. We plot a representative, truth- 
ful peer from the swarm as a point of comparison. Be- 
cause other peers refused to send to the under-reporter 
until he revealed all the pieces in his possession, the 
seeder became the under-reporter’s only remaining op- 
tion. Indeed, the under-reporting peer obtained more 
pieces (73%) from the seeder than any other peer in the 
swarm (11% on average, 6% median). 

These results indicate the power of applying a small 
amount of trust, and small attestations piggybacked on 
existing protocol messages, to a large-scale decentralized 
system. 


$ Conclusions 


In this paper, we presented TrInc, a simple yet power- 
ful abstraction for improving security in distributed sys- 
tems. TrInc is a trusted hardware module that holds a 
non-decreasing counter and a hidden cryptographic key. 
This combination, along with the computational machin- 
ery to support it, yields an abstraction that significantly 
improves various aspects of security in distributed sys- 
tems. 

TrInc was inspired by the seminal work of A2M, 
which introduced the idea of a trusted log for improv- 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 


13 


14 


ing system security. Relative to A2M, TrInc has a sig- 
nificantly simpler abstraction: a counter instead of a log. 
We have also demonstrated a wider range of applications 
for, and benefits from, a trusted module than previously 
shown. 

We have implemented TrInc on real, currently avail- 
able trusted hardware. We have performed three detailed 
case studies of TrInc as applied to different distributed 
protocols. Our results show that this abstraction is easy 
to deploy, powerful, and versatile. 
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Abstract 


Obtaining user opinion (using votes) is essential to rank- 
ing user-generated online content. However, any content 
voting system is susceptible to the Sybil attack where ad- 
versaries can out-vote real users by creating many Sybil 
identities. In this paper, we present SumUp, a Sybil- 
resilient vote aggregation system that leverages the trust 
network among users to defend against Sybil attacks. 
SumUp uses the technique of adaptive vote flow aggre- 
gation to limit the number of bogus votes cast by adver- 
saries to no more than the number of attack edges in the 
trust network (with high probability). Using user feed- 
back on votes, SumUp further restricts the voting power 
of adversaries who continuously misbehave to below the 
number of their attack edges. Using detailed evaluation 
of several existing social networks (YouTube, Flickr), we 
show SumUp’s ability to handle Sybil attacks. By apply- 
ing SumUp on the voting trace of Digg, a popular news 
voting site, we have found strong evidence of attack on 
many articles marked “popular” by Digg. 


1 Introduction 


The Web 2.0 revolution has fueled a massive prolifera- 
tion of user-generated content. While allowing users to 
publish information has led to democratization of Web 
content and promoted diversity, it has also made the Web 
increasingly vulnerable to content pollution from spam- 
mers, advertisers and adversarial users misusing the sys- 
tem. Therefore, the ability to rank content accurately is 
key to the survival and the popularity of many user- 
content hosting sites. Similarly, content rating is also in- 
dispensable in peer-to-peer file sharing systems to help 
users avoid mislabeled or low quality content [7, 16,25]. 

People have long realized the importance of incorpo- 
rating user opinion in rating online content. Traditional 
ranking algorithms such as PageRank [2] and HITS [12] 
rely on implicit user opinions reflected in the link struc- 
tures of hypertext documents. For arbitrary content types, 
user opinion can be obtained in the form of explicit 
votes. Many popular websites today rely on user votes to 
rank news (Digg, Reddit), videos (YouTube), documents 
(Scribd) and consumer reviews (Yelp, Amazon). 

Content rating based on users’ votes is prone to vote 
manipulation by malicious users. Defending against vote 
manipulation is difficult due to the Sybil attack where 
the attacker can out-vote real users by creating many 


Sybil identities. The popularity of content-hosting sites 
has made such attacks very profitable as malicious enti- 
ties can promote low-quality content to a wide audience. 
Successful Sybil attacks have been observed in the wild. 
For example, online polling on the best computer science 
school motivated students to deploy automatic scripts to 
vote for their schools repeatedly [9]. There are even com- 
mercial services that help paying clients promote their 
content to the top spot on popular sites such as YouTube 
by voting from a large number of Sybil accounts [22]. 

In this paper, we present SumUp, a Sybil-resilient on- 
line content voting system that prevents adversaries from 
arbitrarily distorting voting results. SumUp leverages the 
trust relationships that already exist among users (e.g. in 
the form of social relationships). Since it takes human ef- 
forts to establish a trust link, the attacker is unlikely to 
possess many attack edges (links from honest users to an 
adversarial identity). Nevertheless, he may create many 
links among Sybil identities themselves. 

SumUp addresses the vote aggregation problem which 
can be stated as follows: Given m votes on a given object, 
of which an arbitrary fraction may be from Sybil iden- 
tities created by an attacker, how do we collect votes in 
a Sybil resilient manner? A Sybil-resilient vote aggrega- 
tion solution should satisfy three properties. First, the so- 
lution should collect a significant fraction of votes from 
honest users. Second, if the attacker has e,4 attack edges, 
the maximum number of bogus votes should be bounded 
by e., independent of the attacker’s ability to create many 
Sybil identities behind him. Third, if the attacker repeat- 
edly casts bogus votes, his ability to vote in the future 
should be diminished. SumUp achieves all three proper- 
ties with high probability in the face of Sybil attacks. The 
key idea in SumUp is the adaptive vote flow technique 
that appropriately assigns and adjusts link capacities in 
the trust graph to collect the net vote for an object. 

Previous works have also exploited the use of trust net- 
works to limit Sybil attacks [3, 15, 18,26,27,30], but none 
directly addresses the vote aggregation problem. Sybil- 
Limit [26] performs admission control so that at most 
O(log n) Sybil identities are accepted per attack edge 
among n honest identities. As SybilLimit results in 10~30 
bogus votes per attack edge in a million-user system [26], 
Sum Up provides notable improvement by limiting bogus 
votes to one per attack edge. Additionally, SumUp lever- 
ages user feedback to further diminish the voting power 
of adversaries that repeatedly vote maliciously. 
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In SumUp, each vote collector assigns capacities to 
links in the trust graph and computes a set of approx- 
imate max-flow paths from itself to all voters. Because 
only votes on paths with non-zero flows are counted, the 
number of bogus votes collected is limited by the total ca- 
pacity of attack edges instead of links among Sybil iden- 
tities. Typically, the number of voters on a given object 
is much smaller than the total user population (n). Based 
on this insight, SumUp assigns Ciyq2 units of capacity in 
total, thereby limiting the number of votes that can be col- 
lected to be Cyyax. SumUp adjusts Cinqz automatically 
according to the number of honest voters for each object 
so that it can aggregate a large fraction of votes from hon- 
est users. AS Ci,a7 18 far less than n, the number of bo- 
gus votes collected on a single object (1.e. the attack ca- 
pacity) is no more than the number of attack edges (e 4). 
SumUp’s security guarantee on bogus votes 1s probabilis- 
tic. If a vote collector happens to be close to an attack 
edge (a low probability event), the attack capacity could 
be much higher than e4. By re-assigning link capacities 
using feedback, SumUp can restrict the attack capacity to 
be below e 4 even if the vote collector happens to be close 
to some attack edges. 

Using a detailed evaluation of several existing social 
networks (YouTube, Flickr), we show that SumUp suc- 
cessfully limits the number of bogus votes to the num- 
ber of attack edges and is also able to collect > 90% of 
votes from honest voters. By applying SumUp to the vot- 
ing trace and social network of Digg (an online news vot- 
ing site), we have found hundreds of suspicious articles 
that have been marked “popular” by Digg. Based on man- 
ual sampling, we believe that at least 50% of suspicious 
articles exhibit strong evidence of Sybil attacks. 

This paper is organized as follows. In Section 2, we dis- 
cuss related work and in Section 3 we define the system 
model and the vote aggregation problem. Section 4 out- 
lines the overall approach of SumUp and Sections 5 and 
6 present the detailed design. In Section 7, we describe our 
evaluation results. Finally in Section 8, we discuss how to 
extend SumUp to decentralize setup and we conclude in 
Section 9. 


2 Related Work 


Ranking content is arguably one of the Web’s most im- 
portant problems. As users are the ultimate consumers of 
content, incorporating their opinions in the form of either 
explicit or implicit votes becomes an essential ingredient 
in many ranking systems. This section summarizes related 
work in vote-based ranking systems. Specifically, we ex- 
amine how existing systems cope with Sybil attacks [6] 
and compare their approaches to SumUp. 


2.1 Hyperlink-based ranking 


PageRank [2] and HITS [12] are two popular ranking al- 
gorithms that exploit the implicit human judgment embed- 
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ded in the hyperlink structure of web pages. A hyperlink 
from page A to page B can be viewed as an implicit en- 
dorsement (or vote) of page B by the creator of page A. In 
both algorithms, a page has a higher ranking if it is linked 
to by more pages with high rankings. Both PageRank and 
HITS are vulnerable to Sybil attacks. The attacker can 
significantly amplify the ranking of a page A by creating 
many web pages that link to each other and also to A. To 
mitigate this attack, the ranking system must probabilisti- 
cally reset its PageRank computation from a small set of 
trusted web pages with probability € [20]. Despite proba- 
bilistic resets, Sybil attacks can still amplify the PageRank 
of an attacker’s page by a factor of 1/e [29], resulting in a 
big win for the attacker because € is small. 


2.2 User Reputation Systems 


A user reputation system computes a reputation value for 
each identity in order to distinguish well-behaved identi- 
ties from misbehaving ones. It is possible to use a user 
reputation system for vote aggregation: the voting system 
can either count votes only from users whose reputations 
are above a threshold or weigh each vote using the voter’s 
reputation. Like SumUp, existing reputation systems miti- 
gate attacks by exploiting two resources: the trust network 
among users and explicit user feedback on others’ behav- 
iors. We discuss the strengths and limitations of existing 
reputation systems in the context of vote aggregation and 
how SumUp builds upon ideas from prior work. 


Feedback based reputations In EigenTrust [11] and 
Credence [25], each user independently computes person- 
alized reputation values for all users based on past trans- 
actions or voting histories. In EigenTrust, a user increases 
(or decreases) another user’s rating upon a good (or bad) 
transaction. In Credence [25], a user gives a high (or low) 
rating to another user if their voting records on the same 
set of file objects are similar (or dissimilar). Because not 
all pairs of users are known to each other based on direct 
interaction or votes on overlapping sets of objects, both 
Credence and EigenTrust use a PageRank-style algorithm 
to propagate the reputations of known users in order to 
calculate the reputations of unknown users. As such, both 
systems suffer from the same vulnerability as PageRank 
where an attacker can amplify the reputation of a Sybil 
identity by a factor of 1/e. 

Neither EigenTrust nor Credence provide provable 
guarantees on the damage of Sybil attacks under arbitrary 
attack strategies. In contrast, SumUp bounds the voting 
power of an attacker on a single object to be no more than 
the number of attack edges he possesses irrespective of the 
attack strategies in use. SumUp uses only negative feed- 
back as opposed to EigenTrust and Credence that use both 
positive and negative feedback. Using only negative feed- 
back has the advantage that an attacker cannot boost his 
attack capacity easily by casting correct votes on objects 
that he does not care about. 
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DSybil [28] is a feedback-based recommendation sys- 
tem that provides provable guarantees on the damages of 
arbitrary attack strategies. DSybil differs from SumUp in 
its goals. SumUp is a vote aggregation system which al- 
lows for arbitrary ranking algorithms to incorporate col- 
lected votes to rank objects. For example, the ranking al- 
gorithm can rank objects by the number of votes collected. 
In contrast, DSybil’s recommendation algorithm is fixed: 
it recommends a random object among all objects whose 
sum of the weighted vote count exceeds a certain thresh- 
old. 


Trust network-based reputations A number of pro- 
posals from the semantic web and peer-to-peer literature 
rely on the trust network between users to compute repu- 
tations [3, 8, 15,21,30]. Like SumUp, these proposals ex- 
ploit the fact that it is difficult for an attacker to obtain 
many trust edges from honest users because trust links 
reflect offline social relationships. Of the existing work, 
Advogato [15], Appleseed [30] and Sybilproof [3] are re- 
silient to Sybil attacks in the sense that an attacker cannot 
boost his reputation by creating a large number of Sybil 
identities “behind” him. Unfortunately, a Sybil-resilient 
user reputation scheme does not directly translate into a 
Sybil-resilient voting system: Advogato only computes a 
non-zero reputation for a small set of identities, disallow- 
ing a majority of users from being able to vote. Although 
an attacker cannot improve his reputation with Sybil iden- 
tities in Appleseed and Sybilproof, the reputation of Sybil 
identities is almost as good as that of the attacker’s non- 
Sybil accounts. Together, these reputable Sybil identities 
can cast many bogus votes. 


2.3 Sybil Defense using trust networks 


Many proposals use trust networks to defend against Sybil 
attacks in the context of different applications: Sybil- 
Guard [27] and SybilLimit [26] help a node admit an- 
other node in a decentralized system such that the ad- 
mitted node is likely to be an honest node instead of a 
Sybil identity. Ostra [18] limits the rate of unwanted com- 
munication that adversaries can inflict on honest nodes. 
Sybil-resilient DHTs [5, 14] ensure that DHT routing is 
correct in the face of Sybil attacks. Kaleidoscope [23] 
distributes proxy identities to honest clients while mini- 
mizing the chances of exposing them to the censor with 
many Sybil identities. SumUp builds on their insights and 
addresses a different problem, namely, aggregating votes 
for online content rating. Like SybilLimit, SumUp bounds 
the power of attackers according to the number of attack 
edges. In SybilLimit, each attack edge results in O(log n) 
Sybil identities accepted by honest nodes. In SumUp, each 
attack edge leads to at most one vote with high probability. 
Additionally, SumUp uses user feedback on bogus votes 
to further reduce the attack capacity to below the number 
of attack edges. The feedback mechanism of SumUp is 
inspired by Ostra [18]. 


3 The Vote Aggregation Problem 


In this section, we outline the system model and formalize 
the vote aggregation problem that SumUp addresses. 


System model: We describe SumUp in a centralized 
setup where a trusted central authority maintains all the 
information in the system and performs vote aggregation 
using SumUp in order to rate content. This centralized 
mode of operation is suitable for web sites such as Digg, 
YouTube and Facebook, where all users’ votes and their 
trust relationships are collected and maintained by a sin- 
gle trusted entity. We describe how SumUp can be applied 
in a distributed setting in Section 8. 


SumUp leverages the trust network among users to de- 
fend against Sybil attacks [3,15,26,27,30]. Each trust link 
is directional. However, the creation of each link requires 
the consent of both users. Typically, user 2 creates a trust 
link to 7 if 2 has an offline social relationship to 7. Sim- 
ilar to previous work [18, 26], SumUp requires that links 
are difficult to establish. As a result, an attacker only pos- 
sesses a small number of attack edges (e,4) from honest 
users to colluding adversarial identities. Even though e 4 
is small, the attacker can create many Sybil identities and 
link them to adversarial entities. We refer to votes from 
colluding adversaries and their Sybil identities as bogus 
votes. 


SumUp aggregates votes from one or more trusted vote 
collectors. A trusted collector is required in order to break 
the symmetry between honest nodes and Sybil nodes [3]. 
Sum Up can operate in two modes depending on the choice 
of trusted vote collectors. In personalized vote aggrega- 
tion, SumUp uses each user as his own vote collector to 
collect the votes of others. As each user collects a differ- 
ent number of votes on the same object, she also has a 
different (personalized) ranking of content. In global vote 
aggregation, SumUp uses one or more pre-selected vote 
collectors to collect votes on behalf of all users. Global 
vote aggregation has the advantage of allowing for a sin- 
gle global ranking of all objects; however, its performance 
relies on the proper selection of trusted collectors. 


Vote Aggregation Problem: Any identity in the trust 
network including Sybils can cast a vote on any object to 
express his opinion on that object. In the simplest case, 
each vote is either positive or negative (+1 or -1). Alterna- 
tively, to make a vote more expressive, its value can vary 
within a range with higher values indicating more favor- 
able opinions. A vote aggregation system collects votes 
on a given object. Based on collected votes and various 
other features, a separate ranking system determines the 
final ranking of an object. The design of the final rank- 
ing system is outside the scope of this paper. However, we 
note that many ranking algorithms utilize both the number 
of votes and the average value of votes to determine an 
object’s rank [2, 12]. Therefore, to enable arbitrary rank- 
ing algorithms, a vote aggregation system should collect 
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Figure 1: SumUp computes a set of approximate max-flow 
paths from the vote collector s to all voters (A,B,C,D). Straight 
lines denote trust links and curly dotted lines represent the vote 
flow paths along multiple links. Vote flow paths to honest vot- 
ers are “congested” at links close to the collector while paths to 
Sybil voters are also congested at far-away attack edges. 


a significant fraction of votes from honest voters. 

A voting system can also let the vote collector pro- 
vide negative feedback on malicious votes. In personal- 
ized vote aggregation, each collector gives feedback ac- 
cording to his personal taste. In global vote aggregation, 
the vote collector(s) should only provide objective feed- 
back, e.g. negative feedback for positive votes on cor- 
rupted files. Such feedback is available for a very small 
subset of objects. 

We describe the desired properties of a vote aggregation 
system. Let G = (V, E) be a trust network with vote col- 
lector s € V. V is comprised of an unknown set of honest 
users V, C V (including s) and the attacker controls all 
vertices in V \ V;,, many of which represent Sybil iden- 
tities. Let e4 represent the number of attack edges from 
honest users in V;, to V \ V;,. Given that nodes in G cast 
votes on a specific object, a vote aggregation mechanism 
should achieve three properties: 

1. Collect a large fraction of votes from honest users. 

2. Limit the number of bogus votes from the attacker 

by e, independent of the number of Sybil identities 
in V \ Vp. 

3. Eventually ignore votes from nodes that repeatedly 

cast bogus votes using feedback. 


4 Basic Approach 


This section describes the intuition behind adaptive vote 
flow that SumUp uses to address the vote aggregation 
problem. The Key idea of this approach is to appropriately 
assign link capacities to bound the attack capacity. 

In order to limit the number of votes that Sybil identi- 
ties can propagate for an object, SumUp computes a set of 
max-flow paths in the trust graph from the vote collector 
to all voters on a given object. Each vote flow consumes 
one unit of capacity along each link traversed. Figure 1 
gives an example of the resulting flows from the collec- 
tor s to voters A,B,C,D. When all links are assigned unit 
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capacity, the attack capacity using the max-flow based ap- 
proach is bounded by e 4. 

The concept of max-flow has been applied in several 
reputation systems based on trust networks [3, 15]. When 
applied in the context of vote aggregation, the challenge is 
that links close to the vote collector tend to become “con- 
gested” (as shown in Figure 1), thereby limiting the total 
number of votes collected to be no more than the collec- 
tor’s node degree. Since practical trust networks are sparse 
with small median node degrees, only a few honest votes 
can be collected. We cannot simply enhance the capac- 
ity of each link to increase the number of votes collected 
since doing so also increases the attack capacity. Hence, a 
flow-based vote aggregation system faces the tradeoff be- 
tween the maximum number of honest votes it can collect 
and the number of potentially bogus votes collected. 

The adaptive vote flow technique addresses this trade- 
off by exploiting two basic observations. First, the number 
of honest users voting for an object, even a popular one, 
is significantly smaller than the total number of users. For 
example, 99% of popular articles on Digg have fewer than 
4000 votes which represents 1% of active users. Second, 
vote flow paths to honest voters tend to be only “con- 
gested” at links close to the vote collector while paths 
to Sybil voters are also congested at a few attack edges. 
When e, is small, attack edges tend to be far away from 
the vote collector. As shown in Figure 1, vote flow paths 
to honest voters A and B are congested at the link /; while 
paths to Sybil identities C and D are congested at both [2 
and attack edge l/s. 

The adaptive vote flow computation uses three key 
ideas. First, the algorithm restricts the maximum num- 
ber of votes collected on an object to a value Cynaz. AS 
Cmax 18 used to assign the overall capacity in the trust 
graph, a small Ci,qz results in less capacity for the at- 
tacker. SumUp can adaptively adjust Cyy¢7 to collect a 
large fraction of honest votes on any given object. When 
the number of honest voters is O(n”) where a < 1, the 
expected number of bogus votes is limited to 1 + o(1) per 
attack edge (Section 5.4). 

The second important aspect of SumUp relates to ca- 
pacity assignment, i.e. how to assign capacities to each 
trust link in order to collect a large fraction of honest votes 
and only a few bogus ones? In SumUp, the vote collec- 
tor distributes Cy. tickets downstream in a breadth-first 
search manner within the trust network. The capacity as- 
signed to a link is the number of tickets distributed along 
the link plus one. As Figure 2 illustrates, the ticket distri- 
bution process introduces a vote envelope around the vote 
collector s; beyond the envelope all links have capacity 
1. The vote envelope contains C,,¢, nodes that can be 
viewed as entry points. There is enough capacity within 
the envelope to collect C,,¢, votes from entry points. On 
the other hand, an attack edge beyond the envelope can 
propagate at most 1 vote regardless of the number of Sybil 
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Figure 2: Through ticket distribution, SumUp creates a vote en- 
velope around the collector. The capacities of links beyond the 
envelope are assigned to be one, limiting the attack capacity to 
be at most one per attack edge for adversaries outside this en- 
velope. There is enough capacity within the envelope, such that 
nodes inside act like entry points for outside voters. 


identities behind that edge. SumUp re-distributes tickets 
based on feedback to deal with attack edges within the 
envelope. 

The final key idea in SumUp is to leverage user feed- 
back to penalize attack edges that continuously propa- 
gate bogus votes. One cannot penalize individual identi- 
ties since the attacker may always propagate bogus votes 
using new Sybil identities. Since an attack edge is always 
present in the path from the vote collector to a malicious 
voter [18], SumUp re-adjusts capacity assignment across 
links to reduce the capacity of penalized attack edges. 


5 SumUp Design 


In this section, we present the basic capacity assignment 
algorithm that achieves two of the three desired properties 
discussed in Section 3: (a) Collect a large fraction of votes 
from honest users; (b) Restrict the number of bogus votes 
to one per attack edge with high probability. Later in Sec- 
tion 6, we show how to adjust capacity based on feedback 
to deal with repeatedly misbehaved adversarial nodes. 

We describe how link capacities are assigned given a 
particular Cinq in Section 5.1 and present a fast algo- 
rithm to calculate approximate max-flow paths in Sec- 
tion 5.2. In Section 5.3, we introduce an additional op- 
timization strategy that prunes links in the trust network 
so as to reduce the number of attack edges. We formally 
analyze the security properties of SumUp in Section 5.4 
and show how to adaptively set Cy,¢7 in Section 5.5. 


5.1 Capacity assignment 


The goal of capacity assignment is twofold. On the one 
hand, the assignment should allow the vote collector to 
gather a large fraction of honest votes. On the other hand, 
the assignment should minimize the attack capacity such 
that C', © e4. 

As Figure 2 illustrates, the basic idea of capacity as- 
signment is to construct a vote envelope around the vote 


level O level 1 level 2 


Figure 3: Each link shows the number of tickets distributed to 
that link from s (C'maz=6). A node consumes one ticket and 
distributes the remaining evenly via its outgoing links to the next 
level. Tickets are not distributed to links pointing to the same 
level (B—A), or to a lower level (E—B). The capacity of each 
link is equal to one plus the number of tickets. 


collector with at least Cyngz entry points. The goal is 
to minimize the chances of including an attack edge in 
the envelope and to ensure that there is enough capacity 
within the envelope so that all vote flows from Ci,¢7 en- 
try points can reach the collector. 


We achieve this goal using a ticket distribution mecha- 
nism which results in decreasing capacities for links with 
increasing distance from the vote collector. The distri- 
bution mechanism is best described using a propagation 
model where the vote collector is to spread Ciynqz tickets 
across all links in the trust graph. Each ticket corresponds 
to a capacity value of 1. We associate each node with a 
level according to its shortest path distance from the vote 
collector, s. Node s is at level 0. Tickets are distributed to 
nodes one level at a time. If a node at level / has received 
tin, tickets from nodes at level / — 1, the node consumes 
one ticket and re-distributes the remaining tickets evenly 
across all its outgoing links to nodes at level / + 1, Le. 
tout = tin — 1. The capacity value of each link is set to 
be one plus the number of tickets distributed on that link. 
Tickets are not distributed to links connecting nodes at 
the same level or from a higher to lower level. The set of 
nodes with positive incoming tickets fall within the vote 
envelope and thus represent the entry points. 


Ticket distribution ensures that all Cyyqz entry points 
have positive vote flows to the vote collector. Therefore, 
if there exists an edge-independent path connecting one of 
the entry points to an outside voter, the corresponding vote 
can be collected. We show in Section 5.4 that such a path 
exists with good probability. When C4 1s much smaller 
than the number of honest nodes (7), the vote envelope is 
very small. Therefore, all attack edges reside outside the 
envelope, resulting in C'4 & e, with high probability. 


Figure 3 illustrates an example of the ticket distribution 
process. The vote collector (s) is to distribute Ciyq.=6 
tickets among all links. Each node collects tickets from 
its lower level neighbors, keeps one to itself and re- 
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distributes the rest evenly across all outgoing links to the 
next level. In Figure 3, s sends 3 tickets down each of its 
outgoing links. Since A has more outgoing links (3) than 
its remaining tickets (2), link A—D receives no tickets. 
Tickets are not distributed to links between nodes at the 
same level (B—A) or to links from a higher to lower level 
(E—B). The final number of tickets distributed on each 
link is shown in Figure 3. Except for immediate outgoing 
edges from the vote collector, the capacity value of each 
link is equal to the amount of tickets it receives plus one. 


5.2 Approximate Max-flow calculation 


Once capacity assignment is done, the task remains to cal- 
culate the set of max-flow paths from the vote collector to 
all voters on a given object. It is possible to use existing 
max-flow algorithms such as Ford-Fulkerson and Preflow 
push [4] to compute vote flows. Unfortunately, these ex- 
isting algorithms require O(£) running time to find each 
vote flow, where F/ is the number of edges in the graph. 
Since vote aggregation only aims to collect a large fraction 
of honest votes, it is not necessary to compute exact max- 
flow paths. In particular, we can exploit the structure of 
Capacity assignment to compute a set of approximate vote 
flows in O(A) time, where A is the diameter of the graph. 
For expander-like networks, A = O(log n). For practical 
social networks with a few million users, A ~ 20. 

Our approximation algorithm works incrementally by 
finding one vote flow for a voter at a time. Unlike the 
classic Ford-Fulkerson algorithm, our approximation per- 
forms a greedy search from the voter to the collector in 
O(A) time instead of a breadth-first-search from the col- 
lector which takes O( F) running time. Starting at a voter, 
the greedy search strategy attempts to explore a node at 
a lower level if there exists an incoming link with posi- 
tive capacity. Since it is not always possible to find such 
a candidate for exploration, the approximation algorithm 
allows a threshold (¢) of non-greedy steps which explores 
nodes at the same or a higher level. Therefore, the num- 
ber of nodes visited by the greedy search is bounded by 
(A + 2t). Greedy search works well in practice. For links 
within the vote envelope, there is more capacity for lower- 
level links and hence greedy search is more likely to find 
a non-zero capacity path by exploring lower-level nodes. 
For links outside the vote envelope, greedy search results 
in short paths to one of the vote entry points. 


3.3. Optimization via link pruning 


We introduce an optimization strategy that performs link 
pruning to reduce the number of attack edges, thereby re- 
ducing the attack capacity. Pruning is performed prior to 
link capacity assignment and its goal is to bound the in- 
degree of each node to a small value, din tyres. AS a re- 
sult, the number of attack edges is reduced if some ad- 
versarial nodes have more than djn_¢prce5 incoming edges 
from honest nodes. We speculate that the more honest 
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neighbors an adversarial node has, the easier for it to trick 
an honest node into trusting it. Therefore, the number of 
attack edges in the pruned network is likely to be smaller 
than those in the original network. On the other hand, 
pruning is unlikely to affect honest users since each honest 
node only attempts to cast one vote via one of its incoming 
links. 

Since it is not possible to accurately discern honest 
identities from Sybil identities, we give all identities the 
chance to have their votes collected. In other words, prun- 
ing should never disconnect a node. The minimally con- 
nected network that satisfies this requirement is a tree 
rooted at the vote collector. A tree topology minimizes 
attack edges but is also overly restrictive for honest nodes 
because each node has exactly one path from the collec- 
tor: if that path is saturated, a vote cannot be collected. 
A better tradeoff is to allow each node to have at most 
din_thres > 1 incoming links in the pruned network 
so that honest nodes have a large set of diverse paths 
while limiting each adversarial node to only din _thres at- 
tack edges. We examine the specific parameter choice of 
din_thres IN Section 7. 

Pruning each node to have at most din ¢hres INComing 
links is done 1n several steps. First, we remove all links ex- 
cept those connecting nodes at a lower level (/) to neigh- 
bors at the next level ( + 1). Next, we remove a subset of 
incoming links at each node so that the remaining links do 
not exceed din _¢hres. In the third step, we add back links 
removed in step one for nodes with fewer than din_thres 
incoming links. Finally, we add one outgoing link back 
to nodes that have no outgoing links after step three, with 
priority given to links going to the next level. By preferen- 
tially preserving links from lower to higher levels, pruning 
does not interfere with SumUp’s capacity assignment and 
flow computation. 


5.4 Security Properties 


This section provides a formal analysis of the security 
properties of SumUp assuming an expander graph. Vari- 
ous measurement studies have shown that social networks 
are indeed expander-like [13]. The link pruning optimiza- 
tion does not destroy a graph’s expander property because 
it preserves the level of each node in the original graph. 
Our analysis provides bounds on the expected attack 
capacity, C'4, and the expected fraction of votes collected 
if Cmax honest users vote. The average-case analysis 
assumes that each attack edge is a random link in the 
graph. For personalized vote aggregation, the expectation 
is taken over all vote collectors which include all honest 
nodes. In the unfortunate but rare scenario where an ad- 
versarial node is close to the vote collector, we can use 
feedback to re-adjust link capacities (Section 6). 


Theorem 5.1 Given that the trust network G on n nodes 
is a bounded degree expander graph, the expected capac- 
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ity per attack edge is eo = 1 O( Saez log Caz) 
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which is 1 + o(1) if Cmax = O(n") fora < 1. If 
€a*Cmar <n, the capacity per attack edge is bounded 
by 1 with high probability. 


Proof Sketch Let L; represent the number of nodes at 
level 2 with Lo = 1. Let /; be the number of edges point- 
ing from level 2 — 1 to level 2. Notice that £; > L;. Let 
T;, be the number of tickets propagated from level 2 — 1 
to z with Tp = Cmax. The number of tickets at each level 
is reduced by the number of nodes at the previous level 
de. J; = T;_1 — L;_1). Therefore, the number of lev- 
els with non-zero tickets is at most O(log(Cimax)) as D; 
grows exponentially in an expander graph. For a randomly 
placed attack edge, the probability of its being at level 2 is 
at most L; /n. Therefore, the expected capacity of a ran- 
dom attack edge can be calculated as 1 + )7,(= - =) = 
14+>>,(4- mas) = 1+0(<*2 log Cmax). Therefore, 
if Cinaz = O(n”) for a < 1, the expected attack capacity 
per attack edge is 1 + o(1). 

Since the number of nodes within the vote envelope is 
at most Cyy¢z, the probability of a random attack edge 
being located outside the envelope is 1 — ae . Therefore, 
the probability that any of the e 4 attack edges lies within 
the vote envelope is 1 — (1 — Cmaz ea < aos 








<4-~mazr Hence, 
ife~4-Cmar = n~ where a < 1, the attack capacity is 
bounded by 1 with high probability. 


Theorem 5.1 is for expected capacity per attack edge. 
In the worse case when the vote collector is adjacent to 
some adversarial nodes, the attack capacity can be a sig- 
nificant fraction of Cy47. Such rare worst case scenarios 
are addressed in Section 6. 


Theorem 5.2 Given that the trust network G on n nodes 
is a d-regular expander graph, the expected fraction of 
votes that can be collected out of Cmax honest voters is 
— (1- Cee. | where Az is the second largest eigenvalue 
of the adjacency matrix of G. 


Proof Sketch SumUp creates a vote envelop consisting 
of Cmax entry points via which votes are collected. To 
prove that there exists a large fraction of vote flows, we 
argue that the minimum cut of the graph between the set 
of Cmax entry points and an arbitrary set of Cy,¢, honest 
voters is large. 

Expanders are well-connected graphs. In particular, the 
Expander mixing lemma [19] states that for any set S and 
T’ in a d-regular expander graph, the expected number of 
edges between S and T is (d — X2)|S| - |T'|/n, where 
A2 1s the second largest eigenvalue of the adjacency ma- 
trix of G. Let S be a set of nodes containing Ciyyqz en- 
try points and JT’ be a set of nodes containing Cy, hon- 
est voters, thus |S| + |7]| = n and |S| > Cinaz,|T| = 
Cmax. Therefore, the min-cut value between S and T' is 
= (d— X2)|S|-|T|/n 2 (d— A2) + Cmaa(n — Cmax)/N- 
The number of vote flows between S' and T is at least 1/d 


of the min-cut value because each vote flow only uses one 
of an honest voter’s d incoming links. Therefore, the frac- 
tion of votes that can be collected is at least (d — A2) - 
Crean? _ Cran) (0 -d- Cras) a cA2 (1 _ Cisse |, 
For well-connected graphs like expanders, A2 is well sep- 
arated from d, so that a significant fraction of votes can be 
collected. 


5.5 Setting C,,,,, adaptively 


When n, honest users vote on an object, SumUp should 
ideally set Cynax to be n, in order to collect a large frac- 
tion of honest votes on that object. In practice, n,/n is 
very small for any object, even a very popular one. Hence, 
Cmax = Ny <n and the expected capacity per attack 
edge is 1. We note that evenif n, ~ n, the attack capacity 
is still bounded by O(log 7) per attack edge. 

It is impossible to precisely calculate the number of 
honest votes (n,,). However, we can use the actual num- 
ber of votes collected by SumUp as a lower bound esti- 
mate for n,. Based on this intuition, SumUp adaptively 
sets Cmax according to the number of votes collected for 
each object. The adaptation works as follows: For a given 
object, SumUp starts with a small initial value for Cinaz, 
e.g. Cmax = 100. Subsequently, if the number of actual 
votes collected exceeds pCinqx where p is a constant less 
than 1, SumUp doubles the C,,,,; in use and re-runs the 
capacity assignment and vote collection procedures. The 
doubling of Cina continues until the number of collected 
votes becomes less than pCmaz-. 

We show that this adaptive strategy is robust, 1.e. the 
maximum value of the resulting C,,¢, will not dramati- 
cally exceed n,, regardless of the number of bogus votes 
cast by adversarial nodes. Since adversarial nodes at- 
tempt to cast enough bogus votes to saturate attack ca- 
pacity, the number of votes collected is at most n, + C4 
where C'4 = ea(1 + ae log Cmaz). The doubling of 
Cmax Stops when the number of collected votes is less 
than pCnax. Therefore, the maximum value of Cinq that 
stops the adaptation is one that satisfies the following in- 
equality: 





Ma 


Cc, 
Ny + ea(1 als a log Cras) < PO ie 


Since log Cmax < log n, the adaptation terminates with 


Clan = (My +ea)/(p — 28%). As p >> 28”, we derive 
—— 5 (ny +e,). The adaptive strategy doubles Cyrax 


every iteration, hence it overshoots by at most a factor 
of two. Therefore, the resulting Cina found is Cmaz = 
2 (ny +e,). As we can see, the attacker can only affect 
the Cynaz found by an additive factor of e,4. Since e, Is 
small, the attacker has negligible influence on the Cyy¢z 
found. 

The previous analysis is done for the expected case with 
random attack edges. Even in a worst case scenario where 
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some attack edges are very close to the vote collector, the 
adaptive strategy is still resilient against manipulation. In 
the worst case scenario, the attack capacity 1s proportional 
to Cmazr, 1.e. Ca = £Cmazx. Since no vote aggregation 
scheme can defend against an attacker who controls a ma- 
jority of immediate links from the vote collector, we are 
only interested in the case where x < 0.5. The adap- 
tive strategy stops increasing Cyngz When Ny + tC mar < 
pC maz, thus resulting in Cmaz < — . As we can see, p 
must be greater than x to prevent the attacker from caus- 
ing SumUp to increase C47 to infinity. Therefore, we set 
p = 0.5 by default. 





6 Leveraging user feedback 


The basic design presented in Section 5 does not address 
the worst case scenario where C’', could be much higher 
than e,4. Furthermore, the basic design only bounds the 
number of bogus votes collected on a single object. As 
a result, adversaries can still cast up to e4 bogus votes 
on every object in the system. In this section, we utilize 
feedback to address both problems. 

Sum Up maintains a penalty value for each link and uses 
the penalty in two ways. First, we adjust each link’s ca- 
pacity assignment so that links with higher penalties have 
lower capacities. This helps reduce C'4 when some attack 
edges happen to be close to the vote collector. Second, we 
eliminate links whose penalties have exceeded a certain 
threshold. Therefore, if adversaries continuously misbe- 
have, the attack capacity will drop below e, over time. 
We describe how SumUp calculates and uses penalty in 
the rest of the section. 


6.1 Incorporating negative feedback 


The vote collector can choose to associate negative feed- 
back with voters if he believes their votes are malicious. 
Feedback may be performed for a very small set of 
objects-for example, when the collector finds out that an 
object is a bogus file or a virus. 

Sum Up keeps track of a penalty value, p;, for each link 
2 in the trust network. For each voter receiving negative 
feedback, SumUp increments the penalty values for all 
links along the path to that voter. Specifically, if the link 
being penalized has capacity c;, SumUp increments the 
link’s penalty by 1/c;. Scaling the increment by c; is intu- 
itive; links with high capacities are close to the vote col- 
lector and hence are more likely to propagate some bogus 
votes even if they are honest links. Therefore, SumUp im- 
poses a lesser penalty on high capacity links. 

It is necessary to penalize all links along the path in- 
stead of just the immediate link to the voter because that 
voter might be a Sybil identity created by some other at- 
tacker along the path. Punishing a link to a Sybil identity 
is useless as adversaries can easily create more such links. 
This way of incorporating negative feedback is inspired 
by Ostra [18]. Unlike Ostra, SumUp uses a customized 
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flow network per vote collector and only allows the col- 
lector to incorporate feedback for its associated network 
in order to ensure that feedback is always trustworthy. 


6.2 Capacity adjustment 


The capacity assignment in Section 5.1 lets each node dis- 
tribute incoming tickets evenly across all outgoing links. 
In the absence of feedback, it is reasonable to assume that 
all outgoing links are equally trustworthy and hence to 
assign them the same number of tickets. When negative 
feedback is available, a node should distribute fewer tick- 
ets to outgoing links with higher penalty values. Such ad- 
justment is particularly useful in circumstances where ad- 
versaries are close to the vote collector and hence might 
receive a large number of tickets. 

The goal of capacity adjustment is to compute a weight, 
w(p;), as a function of the link’s penalty. The num- 
ber of tickets a node distributes to its outgoing link 2 
is proportional to the link’s weight, Le. t; = tout * 
w(pi)/ dsvienbrs UW (Pi). The question then becomes how 
to compute w(p; ). Clearly, a link with a high penalty value 
should have a smaller weight, i.e. w(p;)<w(p;) if pi>pi. 
Another desirable property is that if the penalties on two 
links increase by the same amount, the ratio of their 
weights remains unchanged. In other words, the weight 
function should satisfy: Vp’, p;, p;, me ; = oe 
This requirement matches our intuition that if two links 
have accumulated the same amount of additional penal- 
ties over a period of time, the relative capacities between 
them should remain the same. Since the exponential func- 
tion satisfies both requirements, we use w(p;) = 0.2? by 
default. 





6.3. Eliminating links using feedback 


Capacity adjustment cannot reduce the attack capacity to 
below e, since each link is assigned a minimum capacity 
value of one. To further reduce e4, we eliminate those 
links that received high amounts of negative feedback. 
We use a heuristic for link elimination: we remove a 
link if its penalty exceeds a threshold value. We use a de- 
fault threshold of five. Since we already prune the trust 
network (Section 5.3) before performing capacity assign- 
ment, we add back a previously pruned link if one exists 
after eliminating an incoming link. The reason why link 
elimination is useful can be explained intuitively: 1f adver- 
saries continuously cast bogus votes on different objects 
over time, all attack edges will be eliminated eventually. 
On the other hand, although an honest user might have 
one of its incoming links eliminated because of a down- 
stream attacker casting bad votes, he is unlikely to expe- 
rience another elimination due to the same attacker since 
the attack edge connecting him to that attacker has also 
been eliminated. Despite this intuitive argument, there al- 
ways exist pathological scenarios where link elimination 
affects some honest users, leaving them with no voting 
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USENIX Association 


Network Nodes | Edges | Degree 


x1000 | x1000 | 50%(90%) 

YouTube [18] 3,458 | 2 (12) 
Flickr [17] 
Synthetic [24] 3000 | 24,248 | 6 (15) 
Table 1: Statistics of the social network traces or synthetic 


model used for evaluating SumUp. AIl statistics are for the 
strongly connected component (SCC). 


No 





power. To address such potential drawbacks, we re-enact 
eliminated links at a slow rate over time. We evaluate the 
effect of link elimination in Section 7. 


7 Evaluation 


In this section, we demonstrate SumUp’s security prop- 
erty using real-world social networks and voting traces. 
Our key results are: 

1. For all networks under evaluation, SumUp bounds 
the average number of bogus votes collected to be no 
more than e, while being able to collect >90% of 
honest votes when less than 17% of honest users vote. 

2. By incorporating feedback from the vote collector, 
SumUp dramatically cuts down the attack capacity 
for adversaries that continuously cast bogus votes. 

3. We apply SumUp to the voting trace and social net- 
work of Digg [1], a news aggregation site that uses 
votes to rank user-submitted news articles. SumUp 
has detected hundreds of suspicious articles that have 
been marked as “popular” by Digg. Based on man- 
ual sampling, we believe at least 50% of suspicious 
articles found by SumUp exhibit strong evidence of 
Sybil attacks. 


7.1 Experimental Setup 


For the evaluation, we use a number of network datasets 
from different online social networking sites [17] as well 
as a synthetic social network [24] as the underlying trust 
network. SumUp works for different types of trust net- 
works as long as an attacker cannot obtain many attack 
edges easily in those networks. Table | gives the statis- 
tics of various datasets. For undirected networks, we treat 
each link as a pair of directed links. Unless explicitly men- 
tioned, we use the YouTube network by default. 

To evaluate the Sybil-resilience of SumUp, we inject 
e4 = 100 attack edges by adding 10 adversarial nodes 
each with links from 10 random honest nodes in the net- 
work. The attacker always casts the maximum bogus votes 
to saturate his capacity. Each experimental run involves 
a randomly chosen vote collector and a subset of nodes 
which serve as honest voters. SumUp adaptively adjusts 
Cmax using an initial value of 100 and p = 0.5. By de- 
fault, the threshold of allowed non-greedy steps is 20. We 
plot the average statistic across five experimental runs in 
all graphs. In Section 7.6, we apply SumUp on the real 
world voting trace of Digg to examine how SumUp can 
be used to resist Sybil attacks in the wild. 
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Avg capacity of an attack edge (C,/e,) 





| ee O.1 
Number of honest voters / total nodes 
Figure 4: The average capacity per attack edge as a function 
of the fraction of honest nodes that vote. The average capacity 
per attack edge remains close to 1, even if 1/10 of honest nodes 
vote. 


7.2  Sybil-resilience of the basic design 


The main goal of SumUp is to limit attack capacity while 
allowing honest users to vote. Figure 4 shows that the 
average attack capacity per attack edge remains close to 
1 even when the number of honest voters approaches 
10%. Furthermore, as shown in Figure 5, SumUp man- 
ages to collect more than 90% of all honest votes in all 
networks. Link pruning is disabled in these experiments. 
The three networks under evaluation have very different 
sizes and degree distributions (see Table 1). The fact that 
all three networks exhibit similar performance suggests 
that SumUp is robust against the topological details. Since 
SumUp adaptively sets C47 in these experiments, the re- 
sults also confirm that adaptation works well in finding a 
Cmax that can collect most of the honest votes without 
significantly increasing attack capacity. We point out that 
the results in Figure 4 correspond to a random vote collec- 
tor. For an unlucky vote collector close to an attack edge, 
he may experience a much larger than average attack ca- 
pacity. In personalized vote collection, there are few un- 
lucky collectors. These unlucky vote collectors need to 
use their own feedback on bogus votes to reduce attack 
capacity. 


Benefits of pruning: The link pruning optimization, in- 
troduced in Section 5.3, further reduces the attack capac- 
ity by capping the number of attack edges an adversarial 
node can have. As Figure 6 shows, pruning does not af- 
fect the fraction of honest votes collected if the threshold 
din_thres 18 greater than 3. Figure 6 represents data from 
the YouTube network and the results for other networks 
are similar. SumUp uses the default threshold (din _tnres) 
of 3. Figure 7 shows that the average attack capacity is 
greatly reduced when adversarial nodes have more than 3 
attack edges. Since pruning attempts to restrict each node 
to at most 3 incoming links, additional attack edges are 
excluded from vote flow computation. 
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Figure 5: The fraction of votes collected as a function of frac- 
tion of honest nodes that vote. SumUp collects more than 80% 
votes, even 1/10 honest nodes vote. 
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Figure 6: The fraction of votes collected for different din_thres 
(YouTube graph). More than 90% votes are collected when 
Oia tives =: 
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Figure 7: Average attack capacity per attack edge decreases as 
the number of attack edges per adversary increases. 
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Figure 8: The fraction of votes collected for different threshold 
for non-greedy steps. More than 70% votes are collected even 

with a small threshold (10) for non-greedy steps. 
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Figure 9: The running time of one vote collector gathering up 
to 1000 votes. The Ford-Fulkerson max-flow algorithm takes 50 
seconds to collect 1000 votes for the YouTube graph. 


7.3. Effectiveness of greedy search 


SumUp uses a fast greedy algorithm to calculate approx- 
imate max vote flows to voters. Greedy search enables 
SumUp to collect a majority of votes while using a small 
threshold (t) of non-greedy steps. Figure 8 shows the frac- 
tion of honest votes collected for the pruned YouTube 
graph. As we can see, with a small threshold of 20, the 
fraction of votes collected is more than 80%. Even when 
disallowing non-greedy steps completely, SumUp man- 
ages to collect > 40% of votes. 


Figure 9 shows the running time of greedy-search for 
different networks. The experiments are performed on 
a single machine with an AMD Opteron 2.5GHz CPU 
and 8GB memory. SumUp takes around 5ms to collect 
1000 votes from a single vote collector on YouTube and 
Flickr. The synthetic network incurs more running time as 
its links are more congested than those in YouTube and 
Flickr. The average non-greedy steps taken in the syn- 
thetic network is 6.5 as opposed to 0.8 for the YouTube 
graph. Greedy-search dramatically reduces the flow com- 
putation time. As a comparison, the Ford-Fulkerson max- 
flow algorithm requires 50 seconds to collect 1000 votes 
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Figure 10: Average attack capacity per attack edge as a function 
of voters. SumUp is better than SybilLimit in the average case. 


for the YouTube graph. 


7.4 Comparison with SybilLimit 


SybilLimit is a node admission protocol that leverages the 
trust network to allow an honest node to accept other hon- 
est nodes with high probability. It bounds the number of 
Sybil nodes accepted to be O(log n). We can apply Sybil- 
Limit for vote aggregation by letting each vote collector 
compute a fixed set of accepted users based on the trust 
network. Subsequently, a vote is collected if and only if it 
comes from one of the accepted users. In contrast, SumUp 
does not calculate a fixed set of allowed users; rather, it 
dynamically determines the set of voters that count toward 
each object. Such dynamic calculation allows SumUp to 
settle on a small Cinqz while still collecting most of the 
honest votes. A small Cinq allows SumUp to bound at- 
tack capacity by e,. 


Figure 10 compares the average attack capacity in 
SumUp to that of SybilLimit for the un-pruned YouTube 
network. The attack capacity in SybilLimit refers to the 
number of Sybil nodes that are accepted by the vote col- 
lector. Since SybilLimit aims to accept nodes instead of 
votes, its attack capacity remains O(log n) regardless of 
the number of actual honest voters. Our implementation 
of SybilLimit uses the optimal set of parameters (w = 15, 
r = 3000) we determined manually. As Figure 10 shows, 
while SybilLimit allows 30 bogus votes per attack edge, 
SumUp results in approximately 1 vote per attack edge 
when the fraction of honest voters is less than 10%. When 
all nodes vote, SumUp leads to much lower attack ca- 
pacity than SybilLimit even though both have the same 
O(log n) asymptotic bound per attack edge. This is due 
to two reasons. First, SumUp’s bound of 1 + logn in 
Theorem 5.1 is a loose upper bound of the actual aver- 
age capacity. Second, since links pointing to lower-level 
nodes are not eligible for ticket distribution, many incom- 
ing links of an adversarial nodes have zero tickets and thus 
are assigned capacity of one. 


1000 














Re 
Attack capacity 
fraction of honest votes collected --------- 7 1.2 a 
bp 
| vo 
— 800 f ov 
Re prreeecesceeeesseeoe== 1 qo 
Pe O 
EE eee Oo 
- v 
if 600 4 Cae 
faa Oo 
v > 
© 
o 0.6 % 
D oO 
400 F G 
“ a 
© 0.4 
p Qy 
5 Oo 
es 200 + . 
+) <Ois2 x 
| 4 
JN " 
O ro 
0 5 10 15 20 25 30 
Timestep 


Figure 11: The change in attack capacity as adversaries contin- 
uously cast bogus votes (YouTube graph). Capacity adjustment 
and link elimination dramatically reduce C',4 while still allowing 
SumUp to collect more than 80% of the honest votes. 


7.5 Benefits of incorporating feedback 


We evaluate the benefits of capacity adjustment and link 
elimination when the vote collector provides feedback 
on the bogus votes collected. Figure 11 corresponds to 
the worst case scenario where one of the vote collec- 
tor’s four outgoing links is an attack edge. At every time 
step, there are 400 random honest users voting on an ob- 
ject and the attacker also votes with its maximum capac- 
ity. When collecting votes on the first object at time step 
1, adaption results in Cman = 2nv — 3900 because 
Ny = 400, p = 0.5, x = 1/4. Therefore, the attacker man- 
ages to cast 5 Cmax = 800 votes and outvote honest users. 
After incorporating the vote collector’s feedback after the 
first time step, the adjacent attack edge incurs a penalty 
of 1 which results in drastically reduced C', (97). If the 
vote collector continues to provide feedback on malicious 
votes, 90% of attack edges are eliminated after only 12 
time steps. After another 10 time steps, all attack edges 
are eliminated, reducing C’,4 to zero. However, because of 
our decision to slowly add back eliminated links, the at- 
tack capacity doesn’t remains at zero forever. Figure 11 
also shows that link elimination has little effects on hon- 
est nodes as the fraction of honest votes collected always 
remains above 80%. 


7.6 Defending Digg against Sybil attacks 


In this section, we ask the following questions: Is there 
evidence of Sybil attacks in real world content voting sys- 
tems? Can SumUp successfully limit bogus votes from 
Sybil identities? We apply SumUp to the voting trace and 
social network crawled from Digg to show the real world 
benefits of SumUp. 

Digg [1] is a popular news aggregation site where any 
registered user can submit an article for others to vote on. 
A positive vote on an article is called a digg. A negative 
vote is called a bury. Digg marks a subset of submitted ar- 
ticles as “popular” articles and displays them on its front 
page. In subsequent discussions, we use the terms pop- 
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Number of Nodes 3,002,907 
Number of Edges 5,063,244 
Number of Nodes in SCC 466,326 
Number of Edges in SCC 4,908,958 
Out degree avg(50%, 90%) 10C1, 9) 
In degree avg(50%, 90%) 10(2, 11) 
Number of submitted (popular) articles 6,494,987 
2004/12/01-2008/09/21 (137,480) 
Diggs on all articles 

ave(50%, 90%) 24(2, 15) 


Diggs on popular articles 

ave(50%, 90%) 

Hours since submission before a popular 
article is marked as popular. 


862(650, 1810) 


avg (50,%,90%) 16(13, 23) 
Number of submitted (popular) articles 38,033 
with bury data available (5,794) 


2008/08/13-2008/09/15 


Table 2: Basic statistics of the crawled Digg dataset. The 


strongly connected component (SCC) of Digg consists of 
466,326 nodes. 


CDF 
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Figure 12: Distribution of diggs for all popular articles before 
being marked as popular and for all articles within 24 hours after 
submission. 


ular or popularity only to refer to the popularity status 
of an article as marked by Digg. A Digg user can cre- 
ate a “follow” link to another user if he wants to browse 
all articles submitted by that user. We have crawled Digg 
to obtain the voting trace on all submitted articles since 
Digg’s launch (2004/12/01-2008/09/21) as well as the 
complete “follow” network between users. Unfortunately, 
unlike diggs, bury data is only available as a live stream. 
Furthermore, Digg does not reveal the user identity that 
cast a bury, preventing us from evaluating SumUp’s feed- 
back mechanism. We have been streaming bury data since 
2008/08/13. Table 2 shows the basic statistics of the Digg 
“follow” network and the two voting traces, one with bury 
data and one without. Although the strongly connected 
component (SCC) consists of only 15% of total nodes, 
88% of votes come from nodes in the SCC. 


There is enormous incentive for an attacker to get a sub- 
mitted article marked as popular, thus promoting it to the 
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Figure 13: The distribution of the fraction of diggs collected by 
SumUp over all diggs before an article is marked as popular. 


front page of Digg which has several million page views 
per day. Our goal is to apply SumUp on the voting trace 
to reduce the number of successful attacks on the popu- 
larity marking mechanism of Digg. Unfortunately, unlike 
experiments done in Section 7.2 and Section 7.5, there is 
no ground truth about which Digg users are adversaries. 
Instead, we have to use SumUp itself to find evidence of 
attacks and rely on manual sampling and other types of 
data to cross check the correctness of results. 


Digg’s popularity ranking algorithm is intentionally not 
revealed to the public in order to mitigate gaming of the 
system. Nevertheless, we speculate that the number of 
diggs is a top contributor to an article’s popularity status. 
Figure 12 shows the distribution of the number of diggs 
an article received before it was marked as popular. Since 
more than 90% of popular articles are marked as such 
within 24 hours after submission, we also plot the number 
of diggs received within 24 hours of submission for all ar- 
ticles. The large difference between the two distributions 
indicates that the number of diggs plays an important role 
in determining an article’s popularity status. 


Instead of simply adding up the actual number of diggs, 
what if Digg uses SumUp to collect all votes on an article? 
We use the identity of Kevin Rose, the founder of Digg, 
as the vote collector to aggregate all diggs on an article 
before it is marked as popular. Figure 13 shows the distri- 
bution of the fraction of votes collected by SumUp over 
all diggs before an article is marked as popular. Our pre- 
vious evaluation on various network topologies suggests 
that SumUp should be able to collect at least 90% of all 
votes. However, in Figure 13, there are a fair number of 
popular articles with much fewer than the expected frac- 
tion of diggs collected. For example, SumUp only man- 
ages to collect less than 50% of votes for 0.5% of popu- 
lar articles. We hypothesize that the reason for collecting 
fewer than the expected votes is due to real world Sybil 
attacks. 


Since there is no ground truth data to verify whether 
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Threshold of the 
fraction of collected diggs 
# of suspicious articles 





Advertisement 
Phishing 
Obscure political articles 
Many newly registered voters | 11 | 7 | 8 | 10 
Fewer than 5Ototaldiggs | 1 | 3 | 6 | 4 
No obvious attack 15 


Table 3: Manual classification of 30 randomly sampled suspi- 


] 
pi | o | 0 | 0 
p 2 {| 2 | 0 | 0 


cious articles. We use different thresholds of the fraction of col- 
lected diggs for marking suspicious articles. An article is labeled 
as having many new voters if > 30% of its votes are from users 
who registered on the same day as the article’s submission date. 


few collected diggs are indeed the result of attacks, we 
resort to manual inspection. We classify a popular article 
as suspicious if its fraction of diggs collected is less than 
a given threshold. Table 3 shows the result of manually 
inspecting 30 random articles out of all suspicious arti- 
cles. The random samples for different thresholds are cho- 
sen independently. There are a number of obvious bogus 
articles such as advertisements, phishing articles and ob- 
scure political opinions. Of the remaining, we find many 
of them have an unusually large fraction (>30%) of new 
voters who registered on the same day as the article’s sub- 
mission time. Some articles also have very few total diggs 
since becoming popular, a rare event since an article typi- 
cally receives hundreds of votes after being shown on the 
front page of Digg. We find no obvious evidence of at- 
tack for roughly half of the sampled articles. Interviews 
with Digg attackers [10] reveal that, although there is a 
fair amount of attack activities on Digg, attackers do not 
usually promote obviously bogus material. This is likely 
due to Digg being a highly monitored system with fewer 
than a hundred articles becoming popular every day. In- 
stead, attackers try to help paid customers promote nor- 
mal or even good content or to boost their profiles within 
the Digg community. 


As further evidence that a lower than expected fraction 
of collected diggs signals a possible attack, we examine 
Digg’s bury data for articles submitted after 2008/08/13, 
of which 5794 are marked as popular. Figure 14 plots the 
correlation between the average number of bury votes on 
an article after it became popular vs. the fraction of the 
diggs SumUp collected before it was marked as popular. 
As Figure 14 reveals, the higher the fraction of diggs col- 
lected by SumUp, the fewer bury votes an article received 
after being marked as popular. Assuming most bury votes 
come from honest users that genuinely dislike the article, 
a large number of bury votes is a good indicator that the 
article is of dubious quality. 


What are the voting patterns for suspicious articles? 


200 


50 


Avg buries after becoming popular 





# of buries after becoming popular —+— 


0 


O 0.2 0.4 0.6 0.8 1 


diggs collected by SumUp / diggs before becoming popular 
Figure 14: The average number of buries an article received 
after it was marked as popular as a function of the fraction of 
diggs collected by SumUp before it is marked as popular. The 
Figure covers 5, 794 popular articles with bury data available. 


Since 88% diggs come from nodes within the SCC, we 
expect only 12% of diggs to originate from the rest of the 
network, which mostly consists of nodes with no incom- 
ing follow links. For most suspicious articles, the reason 
that SumUp collecting fewer than expected diggs is due 
to an unusually large fraction of votes coming from out- 
side the SCC component. Since Digg’s popularity mark- 
ing algorithm is not known, attackers might not bother to 
connect their Sybil identities to the SCC or to each other. 
Interestingly, we found 5 suspicious articles with sophis- 
ticated voting patterns where one voter is linked to many 
identities (~ 30) that also vote on the same article. We be- 
lieve the many identities behind that single voter are likely 
Sybil identities because those identities were all created 
on the same day as the article’s submission. Additionally, 
those identities all have similar usernames. 


8 SumUp in a Decentralized Setting 


Even though SumUp is presented in a centralized setup 
such as a content-hosting Web site, it can also be imple- 
mented in a distributed fashion in order to rank objects 
in peer-to-peer systems. We outline one such distributed 
design for SumUp. In the peer-to-peer environment, each 
node and its corresponding user is identified by a self- 
generated public key. A pair of users create a trust link 
relationship between them by signing the trust statement 
with their private keys. Nodes gossip with each other or 
perform a crawl of the network to obtain a complete trust 
network between any pair of public keys. This is differ- 
ent from Ostra [18] and SybilLimit [26] which address 
the harder problem of decentralized routing where each 
user only knows about a small neighborhood around him- 
self in the trust graph. In the peer-to-peer setup, each user 
naturally acts as his own vote collector to aggregate votes 
and compute a personalized ranking of objects. To obtain 
all votes on an object, a node can either perform flooding 
(like in Credence [25]) or retrieve votes stored in a dis- 
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tributed hash table. In the latter case, it is important that 
the DHT itself be resilient against Sybil attacks. Recent 
work on Sybil-resilient DHTs [5, 14] addresses this chal- 
lenge. 


9 Conclusion 


This paper presented SumUp, a content voting system 
that leverages the trust network among users to defend 
against Sybil attacks. By using the technique of adaptive 
vote flow aggregation, SumUp aggregates a collection of 
votes with strong security guarantees: with high proba- 
bility, the number of bogus votes collected is bounded 
by the number of attack edges while the number of hon- 
est votes collected is high. We demonstrate the real-world 
benefits of SumUp by evaluating it on the voting trace of 
Digg: SumUp detected many suspicious articles marked 
as “popular” by Digg. We have found strong evidence of 
Sybil attacks on many of these suspicious articles. 
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Abstract: ISPs are increasingly reluctant to collect 
and store raw network traces because they can be used 
to compromise their customers’ privacy. Anonymization 
techniques mitigate this concern by protecting sensitive 
information. Trace anonymization can be performed of- 
fline (at a later time) or online (at collection time). Of- 
fline anonymization suffers from privacy problems be- 
cause raw traces must be stored on disk — until the traces 
are deleted, there is the potential for accidental leaks or 
exposure by subpoenas. Online anonymization drasti- 
cally reduces privacy risks but complicates software en- 
gineering efforts because trace processing and anony- 
mization must be performed at line speed. This paper 
presents Bunker, a network tracing system that combines 
the software development benefits of offline anonymiz- 
ation with the privacy benefits of online anonymization. 
Bunker uses virtualization, encryption, and restricted I/O 
interfaces to protect the raw network traces and the trac- 
ing software, exporting only an anonymized trace. We 
present the design and implementation of Bunker, eval- 
uate its security properties, and show its ease of use for 
developing a complex network tracing application. 


1 Introduction 


Network tracing is an indispensable tool for many 
network management tasks. Operators need network 
traces to perform routine network management opera- 
tions, such as traffic engineering [19], capacity plan- 
ning [38], and customer accounting [15]. Several re- 
search projects have proposed using traces for even more 
sophisticated network management tasks, such as diag- 
nosing faults and anomalies [27], recovering from se- 
curity attacks [45], or identifying unwanted traffic [9]. 
Tracing is also vital to networking researchers. As net- 
works and applications grow increasingly complex, un- 
derstanding the behavior of such systems is harder than 
ever. Gathering network traces helps researchers guide 
the design of future networks and applications [42, 49]. 

Customer privacy is a paramount concern for all on- 
line businesses, including ISPs, search engines, and e- 
commerce sites. Many ISPs view possessing raw net- 
work traces as a liability: such traces sometimes end up 
compromising their customers’ privacy through leaks or 
subpoenas. These concerns are real: the RIAA has sub- 
poenaed ISPs to reveal customer identities when pursu- 
ing cases of copyright infringement [16]. Privacy con- 
cerns go beyond subpoenas, however. Oversights or er- 
rors in preparing and managing network trace and server 
log files can seriously compromise users’ privacy by dis- 


closing social security numbers, names, addresses, or 
telephone numbers [5, 54]. 


Trace anonymization is the most common technique 
for addressing these privacy concerns. A typical imple- 
mentation uses a keyed one-way secure hash function to 
obfuscate sensitive information contained in the trace. 
This could be as simple as transforming a few fields in 
the IP headers, or as complex as performing TCP connec- 
tion reconstruction and then obfuscating data (e.g., email 
addresses) deep within the payload. There are two cur- 
rent approaches to anonymizing network traces: offline 
and online. Offline anonymization collects and stores 
the entire raw trace and then performs anonymization 
as a post-processing step. Online anoymization is done 
on-the-fly by extracting and anonymizing sensitive infor- 
mation before it ever reaches the disk. In practice, both 
methods have serious shortcomings that make network 
trace collection increasingly difficult for network opera- 
tors and researchers. 


Offline anonymization poses risks to customer privacy 
because of how raw network traces are stored. These 
risks are growing more severe because of the need to look 
“deeper” into packet payloads, revealing more sensitive 
information. Current privacy trends make it unlikely that 
ISPs will continue to accept the risks associated with of- 
fline anonymization. We have first-hand experience with 
tracing Web, P2P, and e-mail traffic at two universities. 
In both cases the universities deemed the privacy risks as- 
sociated with offline anonymization to be unacceptable. 


While online anonymization offers much stronger pri- 
vacy benefits, it is very difficult to deploy in practice be- 
cause it creates significant software engineering issues. 
Any portion of the trace analysis that requires access to 
sensitive data must be performed on-the-fly and at a rate 
that can handle the network’s peak throughput. This is 
practical for simple tracing applications that analyze only 
IP and TCP headers; however, it is much more difficult 
for tracing applications that require deep packet inspec- 
tion. Developing complex online tracing software there- 
fore poses a significant challenge. Developers are limited 
in their selection of software: adopting garbage-collected 
(e.g., Java, C#) and dynamic scripting (e.g., Python, Perl) 
languages can be difficult; reusing existing libraries (e.g., 
HTML parsers or regexp engines) may also be hard if 
their implementation choices are incompatible with per- 
formance requirements. A network tracing experiment 
illustrates the performance challenges of online tracing. 
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Our goal was to run hundreds of regular expressions to 
identify phishing Web forms. However, an Intel 3.6GHz 
processor running just one of these regular expressions 
(using the off-the-shelf “libpcre’” regexp library) could 
only handle less than 50 Mbps of incoming traffic. 

This paper presents Bunker, a network tracing sys- 
tem built and deployed at the University of Toronto. 
Bunker offers the software development benefits of of- 
fline anonymization and the privacy benefits of online 
anonymization. Our key insight is that we can use the 
buffer-on-disk approach of offline anonymization if we 
can “lock down” the trace files and trace analysis soft- 
ware. This approach lets Bunker avoid all the software 
engineering downsides of online trace analysis. To im- 
plement Bunker, we use virtual machines, encryption, 
and restriction of I/O device configuration to construct a 
closed-box environment; Bunker requires no specialized 
hardware (e.g., a Trusted Platform Module (TPM) or a 
secure cO-processor) to provide its security guarantees. 
The trace analysis and anonymization software is pre- 
loaded into a closed-box VM before any raw trace data 
is gathered. Bunker makes it difficult for network opera- 
tors to interact with the tracing system or to access its in- 
ternal state once it starts running and thereby protects the 
anonymization key, the tracing software, and the raw net- 
work trace files inside the closed-box environment. The 
closed-box environment produces an anonymized trace 
as its only output. 

To protect against physical attacks (e.g., hardware 
tampering), we design Bunker to be safe-on-reboot: 
upon a reboot, all sensitive data gathered by the system 
is effectively destroyed. This property makes physical 
attacks more difficult because the attacker must tamper 
with Bunker’s hardware without causing a reboot. While 
a small class of physical attacks remains feasible (e.g., 
cold boot attacks [21]), in our experience ISPs find the 
privacy benefits offered by a closed-box environment that 
is safe-on-reboot a significant step forward. Although the 
system cannot stop ISPs from being subject to wiretaps, 
Bunker helps protect ISPs against the privacy risks inher- 
ent in collecting and storing network traces. 

Bunker’s privacy properties come at a cost. Bunker 
requires the network operator to pre-plan what data to 
collect and how to anonymize it before starting to trace 
the network. Bunker prevents anyone from changing the 
configuration while tracing; it can be reconfigured only 
through a reboot that will erase all sensitive data. 

The remainder of this paper describes Bunker’s threat 
model (Section 2), design goals and architecture (Sec- 
tion 3), as well as the benefits of Bunker’s architecture 
(Section 4). It then analyzes Bunker’s security proper- 
ties when confronted with a variety of attacks (Section 
5), describes operational issues (Section 6), and evalu- 
ates Bunker’s software engineering benefits by examin- 
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ing a tracing application (phishing analysis) built by one 
student in two months that leverages off-the-shelf com- 
ponents and scripting languages (Section 7). The paper’s 
final sections review legal issues posed by Bunker’s ar- 
chitecture (Section 8) and related work (Section 9). 


2 Threat Model 


This section outlines the threat model for network 
tracing systems. We present five classes of attacks and 
discuss how Bunker addresses each. 


2.1 Subpoenas For Network Traces 


ISPs are discovering that traces gathered for diagnos- 
tic and research purposes can be used in court proceed- 
ings against their customers. As a result, they may view 
the benefits of collecting network traces as being out- 
weighed by the liability of possessing such information. 
Once a subpoena has been issued, an ISP must cooperate 
and reveal the requested information (e.g., traces or en- 
cryption keys) as long as the cooperation does not pose 
an undue burden. Consequently, a raw trace is protected 
against a subpoena only if no one has access to it or to 
the encryption and anonymization keys used to protect it. 

Our architecture was designed to collect traces while 
preserving user privacy even if a court permits a third 
party to have full access to the system. Once a Bunker 
trace has been initiated, all sensitive information is pro- 
tected from the system administrator in the same way it is 
protected from any adversary. Thus, our solution makes 
it a hardship for the ISP to surrender sensitive infor- 
mation. We eliminate potential downsides to collecting 
traces for legitimate purposes but do not prevent those 
with legal wiretap authorization from installing their own 
trace collection system. 


2.2 Accidental Disclosure 


ISPs face another risk, that of accidental disclosure of 
sensitive information from a network trace. History has 
shown that whenever people handle sensitive data, the 
danger of accidental disclosure is substantial. For exam- 
ple, the British Prime Minister recently had to publicly 
apologize when a government agency accidentally lost 
25 million child benefit records containing names and 
bank details because the agency did not follow the cor- 
rect procedure for sending these records by courier [5]. 
Bunker vastly reduces the risk that sensitive data will be 
accidentally released or stolen because no human can ac- 
cess the unanonymized trace. 


2.3 Remote Attacks Over The Internet 


Remote theft of data collected by a tracing machine 
presents another threat to network tracing systems. There 
are many possible ways to break into a system over the 


USENIX Association 


USENIX Association 


network, yet there is one simple solution that eliminates 
this entire class of attacks. To collect traces, Bunker uses 
a specialized network capture card that is incapable of 
sending outgoing data. It also uses firewall rules to limit 
access to the tracing machine from the internal private 
network. Section 5.3 examines in-depth Bunker’s secu- 
rity measures against such attacks. 


2.4 Operational Attacks 


Attacks that traverse the network link being moni- 
tored, such as denial-of-service (DoS) attacks, may also 
incidentally affect the tracing system. This is a problem 
when tracing networks with direct connections to the In- 
ternet: Internet hosts routinely receive attack traffic such 
as vulnerability probes, denial-of-service (DoS) attacks, 
and back-scatter from attacks occurring elsewhere on the 
Internet [36]. Methods exist to reduce the impact of DoS 
attacks [31] and adversarial traffic [13]. However, these 
methods may have limited effectiveness against a large 
enough attack. Both Bunker and offline anonymization 
systems are more resilient to such attacks because they 
need not process the traffic in real time. 

Because many network studies collect traces for rel- 
atively long time periods, an attacker with physical ac- 
cess could tamper with the monitoring system after it 
has started tracing, creating the appearance that the orig- 
inal system is still running. For example, the attacker 
might reboot the system and then set up a new closed- 
box environment that uses anonymization keys known to 
the attacker. Section 6 describes a simple modification to 
Bunker that addresses this type of attack. 


2.5 Attacks On Anonymization 


Packet injection attacks attempt to partially learn the 
anonymization mapping by injecting traffic and then ana- 
lyzing the anonymized trace. To perform such attacks, an 
adversary transmits traffic over the network being traced 
and later identifies this traffic in the anonymized trace. 
These attacks are possible when non-sensitive trace in- 
formation (e.g., times or request sizes) is used to cor- 
relate entries in the anonymized trace with the specific 
traffic being generated by the adversary. Packet injec- 
tion attacks do not completely break the anonymization 
mapping because they do not let the adversary deduce 
the anonymization key. Even without packet injection, 
recent work has shown that private information can still 
be recovered from data anonymized with state-of-the-art 
techniques [10, 34]. These attacks typically make use 
of public information and attempt to correlate it with the 
obfuscated data. Our tracing system is susceptible to at- 
tacks on the anonymization scheme. The best way to de- 
fend against this class of attacks is to avoid public release 
of anonymized trace data [10]. 


Anonymized Data 


Data Analysis 
& 


Anonymization 





Closed-box & Safe-on-reboot 


Raw Traffic 


Figure 1. Logical view of Bunker: Raw data enters the 
closed-box perimeter and only anonymized data leaves 
this perimeter. 


Another problem involves ensuring that the anony- 
mization policy is specified correctly, and that the 
implementation correctly implements the specification. 
Bunker does not explicitly address these issues. We rec- 
ommend code reviews of the trace analysis and anony- 
mization software. However, even a manual audit of this 
software can miss certain properties and anomalies that 
could be exploited by a determined adversary [34]. Al- 
though there is no simple checklist to follow that ensures 
a trace does not leak private data, there are tools that can 
aid in the design and implementation of sound anony- 
mization policies [35]. 


2.6 Summary 


Bunker’s design raises the bar for mounting any of 
these attacks successfully. At a high level, our threat 
model assumes that: (1) the attacker has physical access 
to the tracing infrastructure but no specialized hardware, 
such as a bus monitoring tool; (2) the attacker did not 
participate in implementing the trace analysis software. 
While Bunker’s security design is motivated by the threat 
of subpoenas, it also addresses the other four classes of 
attacks described in this section. We examine security 
attacks against Bunker in Section 5 and we discuss legal 
issues in Section 8. 


3. The Bunker Architecture 


Our main insight when designing Bunker is that a 
tracing infrastructure can maintain large caches of sen- 
sitive data without compromising user privacy as long 
as none of that data leaves the host. Figure 1 illustrates 
Bunker’s high-level design, which takes raw traffic as in- 
put and generates an anonymized trace. 


3.1 Design Goals 


1. Privacy. While the system may store sensitive data 
such as unanonymized packets, it must not permit an out- 
side agent to extract anything other than analysis output. 

2. Ease of development. The system should place as 
few constraints as possible on implementing the analysis 
software. For example, protocol reconstruction and pars- 
ing should not have real-time performance requirements. 
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3. Robustness. Common bugs found in handling 
corner cases in parsing and analysis code should lead to 
small errors in the trace rather than crashing the system 
or completely corrupting its output. 

4. Performance. The proposed system must per- 
form as well as today’s network tracers when running on 
equivalent hardware. In particular, it should be possible 
to trace a high-capacity link with inexpensive hardware. 

5. Use commodity hardware and software. The 
proposed system should not require specialized hard- 
ware, such as a Trusted Platform Module (TPM). 


3.2. Privacy Properties 


To meet our privacy design goal, we must protect all 
gathered trace data even from an attacker who has phys- 
ical access to the network tracing platform. To achieve 
this high-level of protection, we designed Bunker to have 
the following two properties: 

1. Closed-box. The tracing infrastructure runs all 
software that has direct access to the captured trace data 
inside a closed-box environment. Administrators, oper- 
ators, and users cannot interact with the tracing system 
or access its internal state once it starts running. Input 
to the closed-box environment is raw traffic; output is an 
anonymized trace. 

2. Safe-on-reboot. Upon a reboot, all gathered sensi- 
tive data is effectively destroyed. This means that all un- 
encrypted data is actually destroyed; the encryption key 
is destroyed for all encrypted data placed in stable stor- 
age. Bunker uses ECC RAM modules that are zeroed 
out by the BIOS before booting [21]. Thus, it 1s safe-on- 
reboot for reboots caused by pressing the RESET button 
or by powering off the machine. 

The closed-box property prevents an attacker from 
gaining access to the data or to the tracing code while 
it is running. However, this property is not sufficient. 
An attacker could restart the system and boot a different 
software image to access data stored on the tracing sys- 
tem, or an attacker could tamper with the tracing hard- 
ware (e.g., remove a hard drive and plug it in to another 
system). To protect sensitive data against such physical 
attacks, we use the safe-on-reboot property to erase all 
sensitive data upon a reboot. Together, these two proper- 
ties prevent an attacker from gaining access to sensitive 
data via system tampering. 


3.3. The Closed-Box Property 


Bunker uses virtual machines to provide the closed- 
box property. We now describe the rationale for our de- 
sign and implementation. 


3.3.1 Design Approach 


In debating whether to use virtual or physical ma- 
chines (e.g., a sealed appliance) to design our closed-box 
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Figure 2. Overview of Bunker’s implementation. The 
closed-box VM runs a carefully configured Linux kernel. 
The shaded area represents the Trusted Computing Base 
(TCB) of our system. 


environment, we chose the virtual machine option pri- 
marily for flexibility and ease of development. We an- 
ticipated that our design would undergo small modifica- 
tions to accommodate unforeseen problems and worried 
that making small changes to a sealed appliance would 
be too difficult after the initial system was implemented 
and deployed. With VMs, Bunker’s software can be eas- 
ily retrofitted to trace different types of traffic. For exam- 
ple, we used Bunker to gather a trace of Hotmail e-mails 
and to gather flow-level statistics about TCP traffic. 

Virtual machine monitors (VMMs) have been used in 
the past for building closed-box VMs [20, 11]. Using 
virtual machines to provide isolation is especially ben- 
eficial for tasks that require little interaction [6], such 
as network tracing. Bunker runs all software that pro- 
cesses captured data inside a highly trusted closed-box 
VM. Users, administrators, and software in other VMs 
cannot interact with the closed-box or access any of its 
internal state once it starts running. 


3.3.2 Implementation Details 


We used the Xen 3.1 VMM to implement Bunker’s 
closed-box environment. Xen, an open-source VMM, 
provides para-virtualized x86 virtual machines [4]. The 
VMM executes at the highest privilege level on the pro- 
cessor. Above the VMM are the virtual machines, which 
Xen calls domains. Each domain executes a guest oper- 
ating system, such as Linux, which runs at a lower privi- 
lege level than the VMM. 

In Xen, DomainO has a special role: it uses a con- 
trol interface provided by the VMM to perform man- 
agement functions outside of the VMM, such as creating 
other domains and providing access to physical devices 
(including the network interfaces). Both its online trace 
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iptables -P INPUT DROP 
iptables -A INPUT -m state --state ESTABLISHED -j ACCEPT 
iptables -A OUTPUT -m state --state NEW,ESTABLISHED -j ACCEPT 


Figure 3. iptables firewall rules: An abbreviated list of 
the rules that creates a one-way-initiation interface be- 
tween the closed-box VM and the open-box VM. These 
rules allow connections only if they are initiated by the 
closed-box VM. Note that the ESTABLISHED state above 
refers to a connection state used by iptables and not to 
the ESTABLISHED state in the TCP. stack. 


collection and offline trace analysis components are im- 
plemented as a collection of processes that execute on a 
“crippled” Linux kernel that runs in the DomainO VM, 
as shown in Figure 2. 

We carefully configured the Linux kernel running in 
DomainO to run as a closed-box VM. To do this, we 
severely limited the closed-box VM’s I/O capabilities 
and disabled all the kernel functionality (i.e., kernel sub- 
systems and modules) not needed to support tracing. We 
disabled all drivers (including the monitor, mouse and 
keyboard) inside the kernel except for: 1) the network 
capture card driver; 2) the hard disk driver; 3) the vir- 
tual interface driver, used for closed-box VM to open- 
box VM communication, and 4) the standard NIC driver 
used to enable networking in the open-box VM. We also 
disabled the login functionality; nobody, ourselves in- 
cluded, can login to the closed-box VM. Once the kernel 
boots, the kernel init process runs a script that launches 
the tracer. We provide a publicly downloadable copy of 
the kernel configuration file! used to compile the Do- 
mainO kernel so that anyone can audit it. 

The closed-box VM sends anonymized data and non- 
sensitive diagnostic data to the open-box VM via a one- 
way-initiation interface, as follows. We setup a layer- 
3 firewall (e.g., iptables) that allows only those connec- 
tions initiated by the closed-box VM; this firewall drops 
any unsolicited traffic from the open-box VM. Figure 3 
presents an abbreviated list of the firewall rules used to 
configure this interface. 

We deliberately crippled the kernel to restrict all other 
I/O except that from the four remaining drivers. We con- 
figured and examined each driver to eliminate any possi- 
bility of an adversary taking advantage of these channels 
to attack Bunker. Section 5 describes Bunker’s system 
security in greater detail. 


3.4 The Safe-on-Reboot Property 


To implement the safe-on-reboot property, we need to 
ensure that all sensitive data and the anonymization key 
are stored in volatile memory only. However, tracing ex- 
periments frequently generate more sensitive data than 


Mnttp://www.slup.cs.toronto.edu/utmtrace/ 
config-2.6.18-xen0-noscreen 


can fit into memory. For example, a researcher might 
need to capture a very large raw packet trace before run- 
ning a trace analysis program that makes multiple passes 
through the trace. VMMs alone cannot protect data writ- 
ten to disk, because an adversary could simply move the 
drive to another system to extract the data. 


3.4.1 Design Approach 


On boot-up, the closed-box VM selects a random key 
that will be used to encrypt any data written to the hard 
disk. This key (along with the anonymization key) is 
stored only in the closed box VM’s volatile memory, en- 
suring that it is both inaccessible to other VMs and lost 
on reboot. Because data stored on the disk can be read 
only with the encryption key, this approach effectively 
destroys the data after a reboot. The use of encryption to 
make disk storage effectively volatile is not novel; swap 
file encryption is used on some systems to ensure that 
fragments of an application’s memory space do not per- 
sist once the application has terminated or the system has 
restarted [39]. 


3.4.2 Implementation Details 


To implement the safe-on-reboot property, we need to 
ensure that all sensitive information is either stored only 
in volatile memory or on disk using encryption where the 
encryption key is stored only in volatile memory. To im- 
plement the encrypted store, we use the dm-crypt [41] 
device-mapper module from the Linux 2.6.18 kernel. 
This module provides a simple abstraction: it adds an 
encrypted device on top of any ordinary block device. 
As a result, it works with any file system. The dm-crypt 
module supports several encryption schemes; we used 
the optimized implementation of AES. To ensure that 
data in RAM does not accidentally end up on disk, we 
disabled the swap partition. If swapping is needed in the 
future, we could enable dm-crypt on the swap partition. 
The root file system partition that contains the closed- 
box operating system is initially mounted read only. Be- 
cause most Linux configurations expect the root parti- 
tion to be writable, we enable a read-write overlay for 
the root partition that is protected by dm-crypt. This also 
ensures that the trace analysis software does not acciden- 
tally write any sensitive data to disk without encryption. 


3.5 Trace Analysis Architecture 


Bunker’s tracing software consists of two major 
pieces: 1) the online component, independent of the par- 
ticular network tracing experiment, and 2) the offline 
component, which in our case is a phishing analysis trac- 
ing application. Figure 4 shows Bunker’s entire pipeline, 
including the online and offline components. 

Bunker uses tcpdump version 3.9.5 to collect packet 
traces. We fine-tuned tcpdump to increase the size of its 
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Figure 4. Flow of trace data through Bunker’s modules. The online part of Bunker consists of tcpdump and the bfr 
buffering module. The offline part of Bunker consists of bfr, libNids, HTTP parser, Hotmail parser, SpamAssassin, and 
an anonymizer module. Also, tcpdump, bfr, and libNids are generic components to Bunker, wherease HTTP parser, 
Hotmail parser, SpamAssassin, and the anonymized module are specific to our current application: collecting traces 


of phishing e-mail. 


receive buffers. All output from tcpdump is sent directly 
to bfr, a Linux non-blocking pipe buffer that buffers data 
between Bunker’s offline and online components. We 
use multiple memory mapped files residing on the en- 
crypted disks as the bfr buffer and we allocate 380 GB of 
disk space to it, sufficient to buffer over 8 hours of HTTP 
traffic for our network. Figure 5 shows how bfr’s buffer 
size varies over time. 

Our Bunker deployment at the University of Toronto 
is able to trace continuously, even with an unoptimized 
offline component. This is because of the cyclical na- 
ture of network traffic (e.g., previous studies showed that 
university traffic is 1.5 to 2 times lower on a weekend 
day than on a week day [42, 50]). This allows the offline 
component to catch up with the online component dur- 
ing periods of low load, such as nights and weekends. In 
general, Bunker can only trace continuously if the buffer 
drains completely at least once during the week. If the 
peak buffer size during a week day is p and Bunker’s of- 
fline component leaves A unprocessed at the end of a 
week day (see Figure 5), Bunker is able to trace continu- 
ously if the following two conditions hold: 

1. Bunker’s buffer size is larger than 4 x A+ p, or the 
amount of unprocessed data after four consecutive week 
days plus the peak traffic on the fifth week day; 

2. During the weekend, Bunker’s offline component 
can catch up to the online component by at least 5 x A 
of the unprocessed data in the buffer. 

The tracing application we built using Bunker gath- 
ers traces of phishing e-mails received by Hotmail users 
at the University of Toronto. The offline trace analysis 
component performs five tasks: 1) reassembling pack- 
ets into TCP streams; 2) parsing HTTP; 3) parsing Hot- 
mail; 4) running SpamAssassin over the Hotmail e-mails, 
and 5) anonymizing output. To implement each of these 
tasks, we wrote simple Python and Perl scripts that made 
extensive use of existing libraries and tools. 

For TCP/IP reconstruction, we used libNids [48], a C 
library that runs the TCP/IP stack from the Linux 2.0 ker- 
nel in user-space. libNids supports reassembly of both 
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IP fragments and TCP streams. Both the HTTP and the 
Hotmail parsers are written in Python version 2.5. We 
used a wrapper for libNids in Python to interface with our 
HTTP parsing code. Whenever a TCP stream is assem- 
bled, libNids calls a Python function that passes on the 
content to the HTTP and Hotmail parsers. The Hotmail 
parser passes the bodies of the e-mail messages to Spa- 
mAssassin (written in Perl) to utilize its spam and phish- 
ing detection algorithms. The output of SpamAssassin 
is parsed and then added to an internal object that repre- 
sents the Hotmail message. This object is then serialized 
as a Python “pickled” object before it is transferred to 
the anonymization engine. We used an HTTP anonymiz- 
ation policy similar to the one described in [35]. We took 
two additional steps towards ensuring that the anonymiz- 
ation policy is correctly specified and implemented: (1) 
we performed a code review of the policy and its im- 
plementation, and (2) we made the policy and the code 
available to the University of Toronto’s network opera- 
tors encouraging them to inspect it. 


3.6 Debugging 


Debugging a closed-box environment is challenging 
because an attacker could use the debugging interface to 
extract sensitive internal state from the system. Despite 
this restriction, we found the development of Bunker’s 
analysis software to be relatively easy. Our experience 
found the off-the-shelf analysis code we used in Bunker 
to be well tested and debugged. We used two addi- 
tional techniques for helping to debug Bunker’s analysis 
code. First, we tested our software extensively in the lab 
against synthetic traffic sources that do not pose any pri- 
vacy risks. To do this, we booted Bunker into a special 
diagnostic mode that left I/O devices (such as the key- 
board and monitor) enabled. This configuration allowed 
us to easily debug the system and patch the analysis soft- 
ware without rebooting. 

Second, we ensured that every component of our anal- 
ysis software produced diagnostic logs. These logs were 
sent from the closed-box VM to the open-box VM using 
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Figure 5. The size of bfr’s buffer over time. While the 
queue size increases during the day, it decreases during 
night when there is less traffic. At the end of this partic- 
ular day, Bunker’s offline component still had 5OGB of 
unprocessed raw trace left in the buffer. 
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the same interface as the anonymized trace. They proved 
helpful in shedding light on the “health” of the processes 
inside the closed-box VM. We were careful to ensure that 
no sensitive data could be written to the log files in order 
to preserve trace data privacy. 


4 The Benefits of Bunker 


This section presents the benefits offered by Bunker’s 
architecture. 


4.1 Privacy Benefits 


Unlike offline anonymization, our approach does not 
allow network administrators or researchers to work 
directly with sensitive data at any time. Because 
unanonymized trace data cannot be directly accessed, it 
cannot be produced under a subpoena. Our approach 
also greatly reduces the chance that unanonymized data 
will be stolen or accidentally released because individu- 
als cannot easily extract such data from the system. 

The privacy guarantees provided by our tracing sys- 
tem are more powerful than those offered by online 
anonymization. Bunker’s anonymization key is stored 
within the closed-box VM, which prevents anyone from 
accessing it. While online anonymization tracing sys- 
tems are typically careful to avoid writing unanonymized 
data to stable storage, they generally do not protect the 
anonymization key against theft by an adversary with the 
ability to login to the machine. 


4.2 Software Engineering Benefits 


When an encrypted disk is used to store the raw net- 
work trace for later processing, the trace analysis code is 
free to run offline at slower than line speeds. Bunker sup- 
ports two models for tracing. In continuous tracing, the 
disk acts as a large buffer, smoothing the traffic’s bursts 
and its daily cycles. To trace network traffic continu- 
ously, Bunker’s offline analysis code needs to run fast 
enough for the average traffic rate, but it need not keep 
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up with the peak traffic rate. Bunker also supports de- 
ferred trace analysis, where the length of the tracing pe- 
riod is limited by the amount of disk storage, but there 
are no constraints on the performance of the offline trace 
analysis code. In contrast, online anonymization tracing 
systems process data as it arrives and therefore must han- 
dle peak traffic in real-time. 

Bunker’s flexible performance requirements let the 
developer use managed languages and sophisticated li- 
braries when creating trace analysis software. As a re- 
sult, its code is both easier to write and less likely to 
contain bugs. The phishing analysis application using 
Bunker was built by one graduate student in less than two 
months, including the time spent configuring the closed- 
box environment (a one-time cost with Bunker). This 
development effort contrasts sharply with our experience 
developing tracing systems with online anonymization. 
To improve performance, these systems required devel- 
opers to write carefully optimized code in low-level lan- 
guages using sophisticated data structures. Bunker lets 
us use Python scripts to parse HTTP, a TCP/IP reassem- 
bly library, and Perl scripts running SpamAssassin. 


4.3. Fault Handling Benefits 


One serious drawback of most online trace analysis 
techniques is their inability to cope gracefully with bugs 
in the analysis software. Often, these are “corner-case”’ 
bugs that arise in abnormal traffic patterns. In many cases 
researchers and network operators would prefer to ig- 
nore these abnormal flows and continue the data gath- 
ering process; however, if the tracing software crashes, 
all data would be lost until the system can be restarted. 
This could result in the loss of megabytes of data even 
if the restart process is entirely automated. Worse, this 
process introduces systematic bias in the data collection 
because crashes are more likely to affect long-lived than 
short-lived flows. 

Bunker can better cope with bugs because its online 
and offline components are fully decoupled. This pro- 
vides a number of benefits. First, Bunker’s online trace 
collection software is simple because it only captures 
packets and loads them in RAM (encryption is handled 
automatically at the file system layer). Its simplicity and 
size make it easy to test extensively. Second, the on- 
line software need not change even when the type of 
trace analysis being performed changes. Third, the of- 
fline trace analysis software also becomes much simpler 
because it need not be heavily optimized to run at line 
speed. Unoptimized software tends to have a simpler 
program structure and therefore fewer bugs. Simpler 
program structure also makes it easier to recover from 
bugs when they do arise. Finally, a decoupled architec- 
ture makes it possible to identify the flow that caused the 
error in the trace analyzer, filter out that flow from the 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 


35 


36 


buffered raw trace, and restart the trace analyzer so that 
it never sees that flow as input and thereby avoids the bug 
entirely. Section 7 quantifies the effect of this improved 
fault handling on the number of flows that are dropped 
due to a parsing bug. 


5 Security Attacks 


Bunker’s design is inspired by Terra, a VM-based 
platform for trusted computing [20]. Both Terra and 
Bunker protect sensitive data by encapsulating it in a 
closed-box VM with deliberately restricted I/O inter- 
faces. The security of such architectures does not rest 
on the size of the trusted computing base (TCB) but on 
whether an attacker can exploit a vulnerability through 
the system’s narrow interfaces. Even if there is a vulner- 
ability in the OS running in the closed-box VM, Bunker 
remains secure as long as attackers cannot exploit the 
vulnerability through the restricted channels. In our ex- 
perience, ISPs have found Bunker’s security properties a 
significant step forward in protecting users privacy when 
tracing. 

Attacks on Bunker can be categorized into three 
classes. The first are those that attempt to subvert the 
narrow interfaces of the closed-box VM. A successful 
attack on these interfaces exposes the closed-box VM’s 
internals. The second class are physical attacks, in which 
the attacker tampers with Bunker’s hardware. The third 
possibility are attacks whereby Bunker deliberately al- 
lows network traffic into the closed-box VM: an attacker 
could try to exploit a vulnerability in the trace analysis 
software by injecting traffic in the network being moni- 
tored. We now examine each attack type in greater detail. 


5.1 Attacking the Restricted Interfaces of the 
Closed-Box VM 


There are three ways to attack the restricted interfaces 
of the closed-box VM: 1) subverting the isolation pro- 
vided by the VMM to access the memory contents of the 
closed-box VM; 2) exploiting a security vulnerability in 
one of the system’s drivers; and 3) attacking the closed- 
box VM directly using the one-way-initiation interface 
between the closed and open-box VMs. 


5.1.1 Attacking the VMM 


We use a VMM to enforce isolation between soft- 
ware components that need access to sensitive data and 
those that do not. Bunker’s security rests on the assump- 
tion that VMM-based isolation is hard to attack, an as- 
sumption made by many in industry [23, 47] and the re- 
search community [20, 11, 6, 43]. There are other ap- 
proaches we could have used to confine sensitive data 
strictly to the pre-loaded analysis software. For exam- 
ple, we could have used separate physical machines to 
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host the closed and open box systems. Alternatively, we 
could have relied on a kernel and its associated isola- 
tion mechanisms, such as processes and file access con- 
trols. However, VM-based isolation is generally thought 
to provide stronger security than process-based isolation 
because VMMs are small enough to be rigorously ver- 
ified and export only a very narrow interface to their 
VMs [6, 7, 29]. In contrast, kernels are complex pieces 
of software that expose a rich interface to their processes. 


5.1.2 Attacking the Drivers 


Drivers are among the buggiest components of an 
OS [8]. Security vulnerabilities in drivers let attackers 
bypass all access restrictions imposed by the OS. Sys- 
tems without an IOMMU are especially susceptible to 
buggy drivers because they cannot prevent DMA-capable 
hardware from accessing arbitrary memory addresses. 
Many filesystem drivers can be exploited by carefully 
crafted filesystems [53]. Thus, if Bunker were to auto- 
mount inserted media, an attacker could compromise the 
system by inserting a CDROM or USB memory device 
with a carefully crafted filesystem image. 

Bunker addresses such threats by disabling all drivers 
(including the monitor, mouse, and keyboard) except 
these four: 1) the network capture card driver, 2) the 
hard disk driver, 3) the driver for the standard NIC used 
to enable networking in the open-box VM, and 4) the 
driver for the virtual interfaces used between the closed- 
box and open-box VMs. In particular, we were careful 
to disable external storage device support (i.e. CDROM, 
USB mass storage) and USB support. 

We examined each of these drivers and believe that 
none can be exploited to gain access to the closed-box. 
First, the network capture card loads incoming network 
traffic via one of the drivers left enabled in Domain0O. 
This capture card, a special network monitoring card 
made by Endace (DAG 4.3GE) [17], cannot be used for 
two-way communication. Thus, an attacker cannot gain 
remote access to the closed-box solely through this net- 
work interface. The second open communication chan- 
nel is the SCSI controller driver for our hard disks. This 
is a generic Linux driver, and we checked the Linux ker- 
nel mailing lists to ensure that it had no known bugs. The 
third open communication channel, the NIC used by the 
open-box VM, remains in the closed-box VM because 
Xen’s design places all hardware drivers in Domain. We 
considered mapping this driver directly into DomainU, 
but doing so would create challenging security issues re- 
lated to DMA transfers that are best addressed with spe- 
cialized hardware support (SecVisor [43] discusses these 
issues in detail). Instead, we use firewall rules to ensure 
that all outbound communication on this NIC originates 
from the open-box VM. As with the SCSI driver, this is 
a generic Linux gigabit NIC driver, and we verified that 
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it had no known bugs. The final open communication 
channel is constructed by installing a virtual NIC in both 
the closed-box and open-box VMs and then building a 
virtual network between them. ‘Typical for most Xen 
environments, this configuration permits communication 
across different Domains. As with the SCSI driver, we 
checked that it had no known security vulnerabilities. 


5.1.3 Attacking the One-Way-Initiation Interface 


Upon startup, Bunker firewalls the interface between 
the open-box VM and the closed-box VM using ipta- 
bles. The rules used to configure iptables dictate that no 
connections are allowed unless they originate from the 
closed-box VM (see Figure 3). We re-used a set of rules 
from an iptables configuration for firewalling home envi- 
ronments found on the Internet. 


5.2 Attacking Hardware 


Bunker protects the closed-box VM from hardware at- 
tacks by making it safe-on-reboot. If an attacker turns off 
the machine to tamper with the hardware (e.g. by remov- 
ing existing hardware or installing new hardware), the 
sensitive data contained in the closed-box VM is effec- 
tively destroyed. This is because the encryption keys and 
any unencrypted data are only stored in volatile memory 
(RAM). Therefore, hardware attacks must be mounted 
while the system is running. Section 5.1.2 discusses how 
we eliminated all unnecessary drivers from Bunker; this 
protects Bunker against attacks relying on adding new 
system devices, such as USB devices. 

Another class of hardware attacks is one in which the 
attacker attempts to extract sensitive data (e.g., the en- 
cryption keys) from RAM. Such attacks can be mounted 
in many ways. A recent project demonstrated that the 
contents of today’s RAM modules may remain readable 
even minutes after the system has been powered off [21]. 
Bunker is vulnerable to such attacks: an attacker could 
try to extract the encryption keys from memory by re- 
moving the RAM modules from the tracing machine and 
placing them into one configured to run key-searching 
software over memory on bootup [21]. Another approach 
is to attach a bus monitor to observe traffic on the mem- 
ory bus. Preventing RAM-based attacks requires special- 
ized hardware, which we discuss below. Yet another way 
is to attach a specialized device, such as certain Firewire 
devices, that can initiate DMA transfers without any sup- 
port from software running on the host [37, 14]. Prevent- 
ing this attack requires either 1) disabling the Firewire 
controller or 2) support from an IOMMU to limit which 
memory regions can be accessed by Firewire devices. 

Secure Co-processors Can Prevent Hardware At- 
tacks: A secure co-processor contains a CPU pack- 
aged with a moderate amount of non-volatile memory 
enclosed in a tamper-resistant casing [44]. A secure 


co-processor would let Bunker store the encryption and 
anonymization keys, the unencrypted trace data and the 
code in a secure environment. It also allows the code to 
be executed within the secure environment. 

Trusted Platform Modules (TPMs) Cannot Pre- 
vent Hardware Attacks: Unfortunately, the use of 
TPMs would not significantly help Bunker survive hard- 
ware attacks. The limited storage and execution capa- 
bilities of a TPM cannot fully protect encryption keys 
and other sensitive data from an adversary with physical 
access [21]. This is because symmetric encryption and 
decryption are not performed directly by the TPM; these 
operations are still handled by the system’s CPU. There- 
fore, the encryption keys must be exposed to the OS and 
stored in RAM, making them subject to the attack types 
mentioned above. 


5.3. Attacking the Trace Analysis Software 


An attacker could inject carefully crafted network 
traffic to exploit a vulnerability in the trace analysis soft- 
ware, such as a buffer overflow. Because this software 
does not run as root, such attacks cannot disable the nar- 
row interfaces of the closed-box; the attacker needs root 
privileges to alter the OS drivers or the iptable’s firewall 
rules. Nevertheless, such an attack could obtain access 
to sensitive data, skip the anonymization step, and send 
captured data directly to the open-box VM through the 
one-way-initiation interface. 

While possible, such attacks are challenging to mount 
in practice for two reasons. First, Bunker’s trace anal- 
ysis software combines C (e.g., tcpdump plus a TCP/IP 
reconstruction library, which is a Linux 2.0 networking 
stack running in user-space), Python, and Perl. The C 
code is well-known and well-tested, making it less likely 
to have bugs that can be remotely exploited by injecting 
network traffic. Bunker’s application-level parsing code 
is written in Python and Perl, two languages that are re- 
sistant to buffer overflows. In contrast, online anonymiz- 
ers write all their parsing code in unmanaged languages 
(e.g., C or C++) in which it is much harder to handle code 
errors and bugs. 

Second, a successful attack would send sensitive data 
to the open-box VM. The attacker must then find a way 
to extract the data from the open-box VM. To mitigate 
this possibility, we firewall the open-box’s NIC to re- 
ject any traffic unless it originates from our own private 
network. Thus, to be successful, an attacker must not 
only find an exploitable bug in the trace analysis code 
but must also compromise the open-box VM through an 
attack that originates from our private network. 


6 Operational Issues 


At boot time, Bunker’s bootloader asks the user to 
choose between two configurations: an ordinary one and 
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a restricted one. The ordinary configuration loads a typ- 
ical Xen environment with all drivers enabled. We use 
this environment only to prepare a tracing experiment 
and to configure Bunker; we never gather traces in it 
because it offers no privacy benefits. To initiate a trac- 
ing experiment, we boot into the restricted environment. 
When booting into this environment, Bunker’s display 
and keyboard freeze because no drivers are being loaded. 
In this configuration, we use the open NIC to log in to 
the open-box VM where we can monitor the anonymized 
traces received through the one-way-initiation interface. 
These traces also contain meta-data about the health of 
the closed-box VM, including a variety of counters (such 
as packets received, packets lost, usage of memory, and 
amount of free space on the encrypted disk). 

Network studies often need traces that span weeks, 
months, or even years. The closed-box nature of Bunker 
and its long-term use raise the possibility of the following 
operational attack: an intruder gains physical access to 
Bunker, reboots it, and sets it up with a fake restricted en- 
vironment that behaves like Bunker’s restricted environ- 
ment but uses encryption and anonymization keys known 
to the intruder. This attack could remain undetected by 
network operators. From the outside, Bunker seems to 
have gathered network traces continuously. 

To prevent this attack, Bunker could generate a pub- 
lic/private key-pair upon starting the closed-box VM. 
The public key would be shared with the network op- 
erator who saves an offline copy, while the private key 
would never be released from the closed-box VM. To 
verify that Bunker’s code has not been replaced, the 
closed-box VM would periodically send a heartbeat mes- 
sage through the one-way-initiation interface to the open- 
box. The heartbeat message would contain the experi- 
ment’s start time, the current time, and additional coun- 
ters, all signed with the private key to let network opera- 
tors verify that Bunker’s original closed-box remains the 
one currently running. This prevention mechanism is not 
currently implemented. 


7 Evaluation 


This section presents a three-pronged evaluation of 
Bunker. First, we measure the performance overhead in- 
troduced by virtualization and encryption. Second, we 
evaluate Bunker’s software engineering benefits when 
compared to online tracing tools. Third, we conduct an 
experiment to show Bunker’s fault handling benefits. 


7.1 Performance Overhead 


To evaluate the performance overhead of virtualiza- 
tion and encryption, we ran tcpdump (i.e., Bunker’s on- 
line component) to capture all traffic traversing a gigabit 
link and store it to disk. We measured the highest rate of 
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Figure 6. Performance overhead of virtualization and 
encryption: We measured the rate of traffic that tcpdump 
can capture on our machine with no packet losses under 
four configurations: standalone, running in a Xen VM, 
running on top of an encrypted file system, and running 
on top of an encrypted file system in a Xen VM. All output 
captured by tcpdump was written to the disk. 


traffic tcpdump can capture with no packet losses under 
four configurations: standalone, running in a Xen VM, 
running on top of an encrypted disk with dm-crypt [41], 
and running on top of an encrypted disk in a Xen VM. 

Our tracing host is a dual Intel Xeon 3.0GHz with 
4 GB of RAM, six 150 GB SCSI hard-disk drives, and 
a DAG 4.3GE capture card. We ran Linux Debian 4.0 
(etch), kernel version 2.6.18-4 and attached the tracer to 
a dedicated Dell PowerConnect 2724 gigabit switch with 
two other commodity PCs attached. One PC sent con- 
stant bit-rate (CBR) traffic at a configurable rate to the 
other; the switch was configured to mirror all traffic to 
our tracing host. We verified that no packets were being 
dropped by the switch. 

Figure 6 shows the results of this experiment. The 
first bar shows that we capture 925 Mbps when running 
tcpdump on the bare machine with no isolation. The lim- 
iting factor in this case is the rate at which our commod- 
ity PCs can exchange CBR traffic; even after fine tuning, 
they can exchange no more than 925 Mbps on our gi- 
gabit link. The second bar shows that running tcpdump 
inside the closed-box VM has no measurable effect on 
the capture rate because the limiting factor remains our 
traffic injection rate. When we use the Linux dm-crypt 
module for encryption, however, the capture rate drops 
to 817 Mbps even when running on the bare hardware: 
the CPU becomes the bottleneck when running the en- 
cryption module. Combining both virtualization and en- 
cryption shows a further drop in the capture rate, to 618 
Mbps. Once the CPU is fully utilized by the encryp- 
tion module, the additional virtualization costs become 
apparent. 

Our implementation of Bunker can trace network traf- 
fic of up to 618 Mbps with no packet loss. This is suf- 
ficiently fast for the tracing scenario that our university 
requires. While the costs of encryption and virtualiza- 
tion are not negligible, we believe that these overheads 
will decrease over time as Linux and Xen incorporate 
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further optimizations to their block-level encryption and 
virtualization software. At the same time, CPU manu- 
facturers have started to incorporate hardware accelera- 
tion for AES encryption (i.e., similar to what dm-crypt 
uses) [46]. 


7.2 Software Engineering Benefits 


As previously discussed, Bunker offers significant 
software engineering benefits over online network trac- 
ing systems. Figure 7 shows the number of lines of code 
for three network tracing systems that perform HTTP 
parsing, all developed by this paper’s authors. The first 
two systems trace HTTP traffic at line speeds. The first 
system was developed from scratch by two graduate stu- 
dents over the course of one year. The second system 
was developed by one graduate student in nine months; 
this system was built on top of CoMo, a packet-level trac- 
ing system developed by Intel Research [22]. Bunker is 
the third system; it was developed by one student in two 
months. As Figure 7 shows, Bunker’s codebase is an or- 
der of magnitude smaller than the others. Moreover, we 
wrote only about one fifth of Bunker’s code; the remain- 
der was re-used from libraries. 

Bunker’s smaller and simpler codebase comes at a 
cost in terms of its offline component’s performance. 
Figure 8 shows the time elapsed for Bunker’s online 
and offline components to process a 5 minute trace of 
HTTP traffic. The trace contains 4.5 million requests, 
or about 15,000 requests per second, that we generated 
using httpperf. In practice, very few traces contain that 
many HTTP requests per second. While the online com- 
ponent runs only tcpdump storing data to the disk, the of- 
fline component performs TCP/IP reconstruction, parses 
HTTP, and records the HTTP headers before copying the 
trace to the open-box VM. The offline component spends 
20 minutes and 28 seconds processing this trace. Clearly, 
Bunker’s ease of development comes at the cost of per- 
formance, as we did not optimize the HTTP parser at all. 
The privacy guarantees of our isolated environment grant 
us the luxury of re-using existing software components 
even though they do not meet the performance demands 
of online tracing. 


7.3. Fault Handling Evaluation 


In addition to supporting fast development of differ- 
ent tracing experiments, Bunker handles bugs in the trac- 
ing software robustly. Upon encountering a bug, Bunker 
marks the offending flow as “erroneous” and continues 
processing traffic without having to restart. To illus- 
trate the benefits of this fault handling approach, we per- 
formed the following experiment. We used Bunker on 
a Saturday to gather a 20 hour trace of the HTTP traf- 
fic our university exchanges with the Internet. This trace 
contained over 5.2 million HTTP flows. We artificially 
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Figure 7. Lines of Code in three systems for gather- 
ing HTTP traces: The first system was developed from 
scratch by two graduate students in one year. The sec- 
ond system, an extension of CoMo [22], was developed 
by one graduate student in nine months; we included 
CoMo’s codebase when counting the size of this system’ s 
codebase. The third system, Bunker, was developed by 
one student in two months. 


injected a parsing bug in one packet out of 100,000 (cor- 
responding to a parsing error rate of 0.001%). Upon en- 
countering this bug, Bunker stops parsing the erroneous 
HTTP flow and continues with the remaining flows. We 
compare Bunker to an online tracer that would crash 
upon encountering a bug and immediately restart. This 
would result in the online tracer dropping all concurrent 
flows (we refer to this as “collateral damage’’). This ex- 
periment assumes an idealized version of an online tracer 
that restarts instantly; in practice, it takes tens of sec- 
onds to restart an online tracer’s environment losing even 
more ongoing flows. Figure 9 illustrates the difference in 
the fraction of flows affected. While our bug is encoun- 
tered in only 0.08% of the flows, it affects an additional 
31.72% of the flows for an online tracing system. Not 
one of these additional flows is affected by the bug when 
Bunker performs the tracing. 


8 Legal Background 


This section presents legal background concerning the 
issuing of subpoenas for network traces in the U.S. and 
Canada and discusses legal issues inherent in designing 
and deploying data-hiding tracing platforms’. 


8.1 Issuing Subpoenas for Data Traces 


U.S. law has two sets of requirements for obtaining 
a data trace that depend on when the data was gathered. 
For data traces gathered in the past 180 days, the govern- 
ment needs a mere subpoena. Such subpoenas are ob- 
tained from a federal or state court with jurisdiction over 
the offense under investigation. Based on our conver- 
sations with legal experts, obtaining a subpoena is rel- 
atively simple in the context of a lawsuit. A defendant 


* Any mistakes in our characterization of the U.S. or Canadian legal 
systems are the sole responsibility of the authors and not the lawyers 
we consulted during this research project. 
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Figure 8. Online vs. Offline processing speed: The time 
spent processing a five minute HTTP trace by Bunker’s 
online and offline components, respectively. 


(e.g., the ISP) could try to quash the subpoena if compli- 
ance would be unreasonable or oppressive. 

For data gathered more than 180 days earlier, a gov- 
ernment entity needs a warrant under Title 18 United 
States Code 2703(d) from a federal or state court with ap- 
propriate jurisdiction. The government needs to present 
“specific and articulable facts showing that there are rea- 
sonable grounds to believe that the contents of a wire 
or electronic communication, or the records or other in- 
formation sought, are relevant and material to an ongo- 
ing criminal investigation.” The defendant can quash the 
subpoena if the information requested is “unusually vo- 
luminous in nature” or compliance would cause undue 
burden. Based on our discussions with legal experts, the 
court would issue such a warrant if it determines that 
the data is relevant and not duplicative of information 
already held by the government entity. 

In Canada, a subpoena is sufficient to obtain a data 
trace regardless of the data’s age. In 2000, the Cana- 
dian government passed the Personal Information Pro- 
tection and Electronic Documents Act (PIPEDA) [33], 
which enhances the users’ rights to privacy for their data 
held by private companies such as ISPs. However, Sec- 
tion 7(3)(c.1) of PIPEDA indicates that ISPs must dis- 
close personal information (including data traces) if they 
are served with a subpoena or even an “order made by 
a court, person or body with jurisdiction to compel pro- 
duction of information”. In a recent case, a major Cana- 
dian ISP released personal information to the local police 
based on a letter that stated that “the request was done 
under the authority of PIPEDA” [32]. A judge subse- 
quently found that prior authorization for this informa- 
tion should have been obtained, and the ISP should not 
have disclosed this information. This case illustrates the 
complexity of the legal issues ISPs face when they store 
personal information (e.g., raw network traces). 


8.2 Developing Data-Hiding Technology 


In our discussions with legal experts, we investigated 
whether it is legal to develop and deploy a data-hiding 
network tracing infrastructure (such as Bunker). While 
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Figure 9. Fraction of flows affected by a bug in an on- 
line tracer versus in Bunker: A bug crashing an online 
tracer affects all flows running concurrent with the crash. 
Instead, Bunker handles bugs using exceptions affecting 
only the flows that triggered the bug. 


there is no clear answer to this question without legal 
precedent, we learned that the way to evaluate this ques- 
tion is to consider the purpose and potential uses for the 
technology in question. In general, it is legal to deploy 
a technology that has many legitimate uses but could 
also enable certain illegitimate uses. Clearly, technolo- 
gies whose primary use is to enable or encourage users 
to evade the law are not legal. A useful example to il- 
lustrate this distinction is encryption technology. While 
encryption can certainly be used to enable illegal activi- 
ties, its many legitimate uses make development and de- 
ployment of encryption technologies legal. In the con- 
text of network tracing, protecting users’ privacy against 
accidental loss or mismanagement of the trace data is a 
legitimate purpose. 


9 Related Work 


Bunker draws on previous work in network tracing 
systems, data anonymizing techniques, and virtual ma- 
chine usage for securing systems. We summarize this 
previous work and then we describe two systems built to 
protect access to sensitive data, such as network traces. 


9.1 Network Tracing Systems 


One of the earliest network tracing systems was Http- 
dump [51], a tcpdump extension that constructs a log 
of HTTP requests and responses. Windmill [30] devel- 
oped a custom packet filter that facilitates the building 
of specific network analysis applications; it delivers cap- 
tured packets to multiple filters using dynamic code gen- 
eration. BLT [18], a network tracing system developed 
specifically to study HTTP traffic, supports continuous 
online network monitoring. BLT does not use online 
anonymization; instead, it records raw packets directly to 
disk. More recently, CoMo [22] was designed to allow 
independent parties to run multiple ongoing trace anal- 
ysis modules by isolating them from each other. With 
CoMo, anonymization, whether online or offline, must 
be implemented by each module’s owner. Unlike these 
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systems, Bunker’s design was motivated by the need to 
protect the privacy of network users. 


9.2 Anonymization Techniques 


Xu et al. [52] implemented a prefix-preserving anony- 
mization scheme for IP addresses, 1.e., addresses with 
the same IP prefix share the same prefix after anonymiz- 
ation. Pang et al. [35] designed a high-level language for 
specifying anonymization policies, allowing researchers 
to write short policy scripts to express trace transforma- 
tions. Recent work has shown that traces can still leak 
private information even after they are anonymized [34], 
prompting the research community to propose a set 
of guidelines and etiquette for sharing data traces [1]. 
Bunker’s goal is to create a tracing system that makes 
it easy to develop trace analysis software while ensuring 
that no raw data can be exposed from the closed-box VM. 
Bunker does not protect against faulty anonymization 
policies, nor does it ensure that anonymized data cannot 
be subject to the types of attacks described in [34]. 


9.3. Using VMs for Making Systems Secure 


An active research area is designing virtual ma- 
chine architectures that are secure in the face of at- 
tacks. Several solutions have been proposed, includ- 
ing: using tamper-resistant hardware [28, 20]; design- 
ing VMMs that are small enough for formal verifica- 
tion [25, 40]; using programming language techniques 
to provide memory safety and control-flow integrity in 
commodity OS’es [26, 12]; and using hardware memory 
protection to provide code integrity [43]. While these 
systems attempt to secure a general purpose commod- 
ity OS, Bunker was designed only to secure tracing soft- 
ware. As a result, its interfaces are simple and narrow. 


9.4 Protecting Access to Sensitive Data 


Packet Vault [3] is a network tracing system that cap- 
tures packets, encrypts them, and writes them to a CD. 
A newer system design tailored for writing the encrypted 
traces to tape appears in [2]. Packet Vault creates a per- 
manent record of all network traffic traversing a link. Its 
threat model differs from Bunker’s in that there is no at- 
tempt to secure the system against physical attacks. 

Armored Data Vault [24] is a system that implements 
access control to previously collected network traces, by 
using a secure co-processor to enforce security in the face 
of malicious attackers. Like Bunker, network traces are 
encrypted before being stored. The encryption key and 
any raw data are stored inside the secure co-processor. 
Bunker’s design differs from Armored Data Vault’s in 
three important ways. First, Bunker’s goal is limited to 
trace anonymization only and not to implementing ac- 
cess control policies; this lets us use simple, off-the-shelf 


anonymization code to minimize the likelihood of bugs 
present in the system. Second, Bunker destroys the raw 
data as soon as it is anonymized; the Armored Data Vault 
stores its raw traces permanently while enforcing the data 
access policy. Finally, Bunker uses commodity hard- 
ware that can run unmodified off-the-shelf software. In- 
stead, the authors of the Armored Data Vault had to port 
their code to accommodate the specifics of the secure co- 
processor, a process that required effort and affected the 
system’s performance [24]. 


10 Conclusions 


This paper presents Bunker, a network tracing archi- 
tecture that combines the performance and software en- 
gineering benefits of offline anonymization with the pri- 
vacy offered by online anonymization. Bunker uses a 
closed-box and safe-on-reboot architecture to protect raw 
trace data against a large class of security attacks, includ- 
ing physical attacks to the system. In addition to its secu- 
rity benefits, our architecture improves ease of develop- 
ment: using Bunker, one graduate student implemented a 
network tracing system for gathering anonymized traces 
of Hotmail e-mail in less than two months. 

Our evaluation shows that Bunker has adequate per- 
formance. We show that Bunker’s codebase is an order 
of magnitude smaller than previous network tracing sys- 
tems that perform online anonymization. Because most 
of its data processing is performed offline, Bunker also 
handles faults more gracefully than previous systems. 
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Abstract 


WheelFS is a wide-area distributed storage system in- 
tended to help multi-site applications share data and gain 
fault tolerance. WheelFS takes the form of a distributed 
file system with a familiar POSIX interface. Its design al- 
lows applications to adjust the tradeoff between prompt 
visibility of updates from other sites and the ability for 
sites to operate independently despite failures and long 
delays. WheelFS allows these adjustments via semantic 
cues, which provide application control over consistency, 
failure handling, and file and replica placement. 

WheelFS is implemented as a user-level file system and 
is deployed on PlanetLab and Emulab. Three applications 
(a distributed Web cache, an email service and large file 
distribution) demonstrate that WheelFS’s file system in- 
terface simplifies construction of distributed applications 
by allowing reuse of existing software. These applica- 
tions would perform poorly with the strict semantics im- 
plied by a traditional file system interface, but by pro- 
viding cues to WheelFS they are able to achieve good 
performance. Measurements show that applications built 
on WheelFS deliver comparable performance to services 
such as CoralCDN and BitTorrent that use specialized 
wide-area storage systems. 


1 Introduction 


There is a growing set of Internet-based services that are 
too big, or too important, to run at a single site. Examples 
include Web services for e-mail, video and image hosting, 
and social networking. Splitting such services over mul- 
tiple sites can increase capacity, improve fault tolerance, 
and reduce network delays to clients. These services often 
need storage infrastructure to share data among the sites. 
This paper explores the use of a new file system specif- 
ically designed to be the storage infrastructure for wide- 
area distributed services. 

A wide-area storage system faces a tension between 
sharing and site independence. The system must support 
sharing, so that data stored by one site may be retrieved 
by others. On the other hand, sharing can be dangerous if 
it leads to the unreachability of one site causing blocking 
at other sites, since a primary goal of multi-site opera- 
tion is fault tolerance. The storage system’s consistency 
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model affects the sharing/independence tradeoff: stronger 
forms of consistency usually involve servers or quorums 
of servers that serialize all storage operations, whose un- 
reliability may force delays at other sites [23]. The storage 
system’s data and meta-data placement decisions also af- 
fect site independence, since data placed at a distant site 
may be slow to fetch or unavailable. 


The wide-area file system introduced in this paper, 
Wheel FS, allows application control over the sharing/in- 
dependence tradeoff, including consistency, failure han- 
dling, and replica placement. Each application can choose 
a tradeoff between performance and consistency, in the 
style of PRACTI [8] and PADS [9], but in the context of a 
file system with a POSIX interface. 


Central decisions in the design of WheelFS includ- 
ing defining the default behavior, choosing which behav- 
lors applications can control, and finding a simple way 
for applications to specify those behaviors. By default, 
WheelFS provides standard file system semantics (close- 
to-open consistency) and is implemented similarly to pre- 
vious wide-area file systems (e.g., every file or directory 
has a primary storage node). Applications can adjust the 
default semantics and policies with semantic cues. The set 
of cues is small (around 10) and directly addresses the 
main challenges of wide-area networks (orders of magni- 
tude differences in latency, lower bandwidth between sites 
than within a site, and transient failures). WheelFS allows 
the cues to be expressed in the pathname, avoiding any 
change to the standard POSIX interface. The benefits of 
WheelFS providing a file system interface are compatibil- 
ity with existing software and programmer ease-of-use. 


A prototype of WheelFS runs on FreeBSD, Linux, and 
MacOS. The client exports a file system to local applica- 
tions using FUSE [21]. WheelFS runs on PlanetLab and 
an emulated wide-area Emulab network. 


Several distributed applications run on WheelFS and 
demonstrate its usefulness, including a distributed Web 
cache and a multi-site email service. The applications use 
different cues, showing that the control that cues pro- 
vide is valuable. All were easy to build by reusing ex- 
isting software components, with WheelFS for storage 
instead of a local file system. For example, the Apache 
caching web proxy can be turned into a distributed, co- 
operative Web cache by modifying one pathname in a 
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configuration file, specifying that Apache should store 
cached data in WheelFS with cues to relax consistency. 
Although the other applications require more changes, the 
ease of adapting Apache illustrates the value of a file sys- 
tem interface; the extent to which we could reuse non- 
distributed software in distributed applications came as a 
surprise [38]. 

Measurements show that WheelFS offers more scal- 
able performance on PlanetLab than an implementation of 
NFSv4, and that for applications that use cues to indicate 
they can tolerate relaxed consistency, WheelFS continues 
to provide high performance in the face of network and 
server failures. For example, by using the cues .Eventu- 
alConsistency, .MaxTime, and .Hotspot, the distributed 
Web cache quickly reduces the load on the origin Web 
server, and the system hardly pauses serving pages when 
WheelFS nodes fail; experiments on PlanetLab show that 
the WheelFS-based distributed Web cache reduces origin 
Web server load to zero. Further experiments on Emu- 
lab show that WheelFS can offer better file downloads 
times than BitTorrent [14] by using network coordinates 
to download from the caches of nearby clients. 

The main contributions of this paper are a new file 
system that assists in the construction of wide-area dis- 
tributed applications, a set of cues that allows applications 
to control the file system’s consistency and availability 
tradeoffs, and a demonstration that wide-area applications 
can achieve good performance and failure behavior by us- 
ing Wheel FS. 

The rest of the paper is organized as follows. Sections 2 
and 3 outline the goals of WheelFS and its overall de- 
sign. Section 4 describes WheelFS’s cues, and Section 5 
presents WheelFS’s detailed design. Section 6 illustrates 
some example applications, Section 7 describes the imple- 
mentation of WheelFS, and Section 8 measures the per- 
formance of WheelFS and the applications. Section 9 dis- 
cusses related work, and Section 10 concludes. 


2 Goals 


A wide-area storage system must have a few key prop- 
erties in order to be practical. It must be a useful building 
block for larger applications, presenting an easy-to-use in- 
terface and shouldering a large fraction of the overall stor- 
age management burden. It must allow inter-site access to 
data when needed, as long as the health of the wide-area 
network allows. When the site storing some data is not 
reachable, the storage system must indicate a failure (or 
find another copy) with relatively low delay, so that a fail- 
ure at one site does not prevent progress at other sites. Fi- 
nally, applications may need to control the site(s) at which 
data are stored in order to achieve fault-tolerance and per- 
formance goals. 

As an example, consider a distributed Web cache whose 
primary goal is to reduce the load on the origin servers of 
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popular pages. Each participating site runs a Web proxy 
and a part of a distributed storage system. When a Web 
proxy receives a request from a browser, it first checks 
to see if the storage system has a copy of the requested 
page. If it does, the proxy reads the page from the stor- 
age system (perhaps from another site) and serves it to the 
browser. If not, the proxy fetches the page from the origin 
Web server, inserts a copy of it into the storage system (so 
other proxies can find it), and sends it to the browser. 

The Web cache requires some specific properties from 
the distributed storage system in addition to the general 
ability to store and retrieve data. A proxy must serve data 
with low delay, and can consult the origin Web server if 
it cannot find a cached copy; thus it is preferable for the 
storage system to indicate “not found” quickly if finding 
the data would take a long time (due to timeouts). The 
storage need not be durable or highly fault tolerant, again 
because proxies can fall back on the origin Web server. 
The storage system need not be consistent in the sense of 
guaranteeing to find the latest stored version of document, 
since HTTP headers allow a proxy to evaluate whether a 
cached copy is still valid. 

Other distributed applications might need different 
properties in a storage system: they might need to see the 
latest copy of some data, and be willing to pay a price in 
high delay, or they may want data to be stored durably, 
or have specific preferences for which site stores a doc- 
ument. Thus, in order to be a usable component in many 
different systems, a distributed storage system needs to 
expose a level of control to the surrounding application. 


3 WheelFS Overview 


This section gives a brief overview of WheelFS to help the 
reader follow the design proposed in subsequent sections. 


3.1 System Model 


WheelFS is intended to be used by distributed applica- 
tions that run on a collection of sites distributed over the 
wide-area Internet. All nodes in a WheelFS deployment 
are either managed by a single administrative entity or 
multiple cooperating administrative entities. WheelFS’s 
security goals are limited to controlling the set of partici- 
pating servers and imposing UNIX-like access controls on 
clients; it does not guard against Byzantine failures in par- 
ticipating servers [6,26]. We expect servers to be live and 
reachable most of the time, with occasional failures. Many 
existing distributed infrastructures fit these assumptions, 
such as wide-area testbeds (e.g., PlanetLab and RON), 
collections of data centers spread across the globe (e.g., 
Amazon’s EC2), and federated resources such as Grids. 


3.2 System Overview 


Wheel FS provides a location-independent hierarchy of di- 
rectories and files with a POSIX file system interface. At 
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any given time, every file or directory object has a single 
“primary” WheelFS storage server that is responsible for 
maintaining the latest contents of that object. WheelFS 
clients, acting on behalf of applications, use the storage 
servers to retrieve and store data. By default, clients con- 
sult the primary whenever they modify an object or need 
to find the latest version of an object. Accessing a single 
file could result in communication with several servers, 
since each subdirectory in the path could be served by a 
different primary. WheelFS replicates an object’s data us- 
ing primary/backup replication, and a background mainte- 
nance process running on each server ensures that data are 
replicated correctly. Each update to an object increments 
a version number kept in a separate meta-data structure, 
co-located with the data. 

When a Wheel FS client needs to use an object, it must 
first determine which server is currently the primary for 
that object. All nodes agree on the assignment of objects 
to primaries to help implement the default strong consis- 
tency. Nodes learn the assignment from a configuration 
service—a replicated state machine running at multiple 
sites. This service maintains a table that maps each object 
to one primary and zero or more backup servers. WheelFS 
nodes cache a copy of this table. Section 5 presents the de- 
sign of the configuration service. 

A WheelFS client reads a file’s data in blocks from 
the file’s primary server. The client caches the file’s data 
once read, obtaining a lease on its meta-data (including 
the version number) from the primary. Clients have the 
option of reading from other clients’ caches, which can 
be helpful for large and popular files that are rarely up- 
dated. WheelFS provides close-to-open consistency by 
default for files, so that if an application works correctly 
on a POSIX file system, it will also work correctly on 
Wheel FS. 


4 Semantic cues 


WheelFS provides semantic cues within the standard 
POSIX file system API. We believe cues would also be 
useful in the context of other wide-area storage layers with 
alternate designs, such as Shark [6] or a wide-area version 
of BigTable [13]. This section describes how applications 
specify cues and what effect they have on file system op- 
erations. 


4.1 Specifying cues 


Applications specify cues to WheelFS in pathnames; for 
example, /wfs/.Cue/data refers to /wfs/data with the cue 
-Cue. The main advantage of embedding cues in path- 
names is that it keeps the POSIX interface unchanged. 
This choice allows developers to program using an inter- 
face with which they are familiar and to reuse software 
easily. 

One disadvantage of cues is that they may break soft- 


ware that parses pathnames and assumes that a cue is a 
directory. Another is that links to pathnames that contain 
cues may trigger unintuitive behavior. We have not en- 
countered examples of these problems. 

WheelFS clients process the cue path components lo- 
cally. A pathname might contain several cues, separated 
by slashes. WheelFS uses the following rules to combine 
cues: (1) a cue applies to all files and directories in the 
pathname appearing after the cue; and (2) cues that are 
specified later in a pathname may override cues in the 
same category appearing earlier. 

As a preview, a distributed Web cache could be 
built by running a caching Web proxy at each of a 
number of sites, sharing cached pages via WheelFS. 
The proxies could store pages in pathnames such as 
/wfs/.MaxTime=200/url, causing open() to fail after 
200 ms rather than waiting for an unreachable WheelFS 
server, indicating to the proxy that it should fetch from 
the original Web server. See Section 6 for a more sophis- 
ticated version of this application. 


4.2 Categories 


Table | lists WheelFS’s cues and the categories into which 
they are grouped. There are four categories: placement, 
durability, consistency, and large reads. These categories 
reflect the goals discussed in Section 2. The placement 
cues allow an application to reduce latency by placing 
data near where it will be needed. The durability and con- 
sistency cues help applications avoid data unavailability 
and timeout delays caused by transient failures. The large 
read cues increase throughput when reading large and/or 
popular files. Table 2 shows which POSIX file system API 
calls are affected by which of these cues. 

Each cue is either persistent or transient. A persistent 
cue 1s permanently associated with the object, and may 
affect all uses of the object, including references that do 
not specify the cue. An application associates a persistent 
cue with an object by specifying the cue when first creat- 
ing the object. Persistent cues are immutable after object 
creation. If an application specifies a transient cue in a file 
system operation, the cue only applies to that operation. 

Because these cues correspond to the challenges faced 
by wide-area applications, we consider this set of cues to 
be relatively complete. These cues work well for the ap- 
plications we have considered. 


4.3 Placement 


Applications can reduce latency by storing data near the 
clients who are likely to use that data. For example, a 
wide-area email system may wish to store all of a user’s 
message files at a site near that user. 

The .Site=X cue indicates the desired site for a newly- 
created file’s primary. The site name can be a simple 
string, e.g. .Site=westcoast, or a domain name such as 
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Cue Category | Cue Name 


Type 


P 
T 
TT 
T 
T 
T 


Placement Site=X 
-Keep Together 
-.RepSites=Nrs 
Durability -RepLevel=Nrxz 
-SyncLevel=N 57 
Consistency .EventualConsistency 


~MaxTime=T 





Large reads -WholeFile 


-Hotspot 





Meaning (and Tradeoffs) 


Store files and directories on a server at the site named X. 

Store all files in a directory subtree on the same set of servers. 

Store replicas across Nrg different sites. 

Keep Nprz replicas for a data object. 

Wait for only Nsz replicas to accept a new file or directory version, reduc- 
ing both durability and delay. 

Use potentially stale cached data, or data from a backup, if the primary 
does not respond quickly. 

Limit any WheelFS remote communication done on behalf of a file system 
operation to no more than 7’ ms. 

Enable pre-fetching of an entire file upon the first read request. 

Fetch file data from other clients’ caches to reduce server load. Fetches 
multiple blocks in parallel if used with .WholeFile. 


Table 1: Semantic cues. A cue can be either Persistent or Transient (*Section 4.5 discusses a caveat for .EventualConsistency). 





Ce |FI1S/ 8) E/S/ 2] 2/8] 8] 3/4 
S 

KT 

RS x [x[ [x 
RL x [x [xT Tx 
SL_[X[ xX] [x [xX [xT x yxy [x 
EC xX 
MT xs 


Table 2: The POSIX file system API calls affected by each cue. 


.Site=rice.edu. An administrator configures the corre- 
spondence between site names and servers. If the path 
contains no .Site cue, WheelFS uses the local node’s site 
as the file’s primary. Use of random as the site name will 
spread newly created files over all sites. If the site indi- 
cated by .Site is unreachable, or cannot store the file due 
to storage limitations, WheelFS stores the newly created 
file at another site, chosen at random. The WheelFS back- 
ground maintenance process will eventually transfer the 
misplaced file to the desired site. 


The .KeepTogether cue indicates that an entire sub- 
tree should reside on as few WheelFS nodes as possible. 
Clustering a set of files can reduce the delay for operations 
that access multiple files. For example, an email system 
can store a user’s message files on a few nodes to reduce 
the time required to list all messages. 


The .RepSites=N rg cue indicates how many different 
sites should have copies of the data. Neg only has an 
effect when it is less than the replication level (see Sec- 
tion 4.4), in which case it causes one or more sites to 
store the data on more than one local server. When pos- 
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sible, WheelFS ensures that the primary’s site is one of 
the sites chosen to have an extra copy. For example, spec- 
ifying .RepSites=2 with a replication level of three causes 
the primary and one backup to be at one site, and another 
backup to be at a different site. By using .Site and .Rep- 
Sites, an application can ensure that a permanently failed 
primary can be reconstructed at the desired site with only 
local communication. 


4.4 Durability 


WheelFS allows applications to express durability 
preferences with two cues: .RepLevel=Nerz and 
-SyncLevel=N g_. 

The .RepLevel=Nrz cue causes the primary to store 
the object on Nez —1 backups; by default, Nrezp= 3. The 
WheelFS prototype imposes a maximum of four replicas 
(see Section 5.2 for the reason for this limit; in a future 
prototype it will most likely be higher). 

The .SyncLevel=Ng,7 cue causes the primary to wait 
for acknowledgments of writes from only Ng z of the ob- 
ject’s replicas before acknowledging the client’s request, 
reducing durability but also reducing delays if some back- 
ups are slow or unreachable. By default, Nos; = Nr. 


4.5 Consistency 


The .EventualConsistency cue allows a client to use an 
object despite unreachability of the object’s primary node, 
and in some cases the backups as well. For reads and 
pathname lookups, the cue allows a client to read from a 
backup if the primary is unavailable, and from the client’s 
local cache if the primary and backups are both unavail- 
able. For writes and filename creation, the cue allows a 
client to write to a backup if the primary is not available. 
A consequence of .EventualConsistency is that clients 
may not see each other’s updates if they cannot all reli- 
ably contact the primary. Many applications such as Web 
caches and email systems can tolerate eventual consis- 
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tency without significantly compromising their users’ ex- 
perience, and in return can decrease delays and reduce ser- 
vice unavailability when a primary or its network link are 
unreliable. 

The cue provides eventual consistency in the sense that, 
in the absence of updates, all replicas of an object will 
eventually converge to be identical. However, WheelFS 
does not provide eventual consistency in the rigorous form 
(e.g., [18]) used by systems like Bayou [39], where all 
updates, across all objects in the system, are committed 
in a total order at all replicas. In particular, updates in 
WheelFS are only eventually consistent with respect to 
the object they affect, and updates may potentially be lost. 
For example, if an entry is deleted from a directory under 
the .EventualConsistency cue, it could reappear in the 
directory later. 

When reading files or using directory contents with 
eventual consistency, a client may have a choice between 
the contents of its cache, replies from queries to one or 
more backup servers, and a reply from the primary. A 
client uses the data with the highest version number that 
it finds within a time limit. The default time limit is one 
second, but can be changed with the .MaxTime=T cue (in 
units of milliseconds). If .MaxTime is used without even- 
tual consistency, the WheelFS client yields an error if it 
cannot contact the primary after the indicated time. 

The background maintenance process periodically rec- 
onciles a primary and its backups so that they eventually 
contain the same data for each file and directory. The pro- 
cess may need to resolve conflicting versions of objects. 
For a file, the process chooses arbitrarily among the repli- 
cas that have the highest version number; this may cause 
writes to be lost. For an eventually-consistent directory, it 
puts the union of files present in the directory’s replicas 
into the reconciled version. If a single filename maps to 
multiple IDs, the process chooses the one with the small- 
est ID and renames the other files. Enabling directory 
merging is the only sense in which the .EventualConsis- 
tency cue is persistent: if specified at directory creation 
time, it guides the conflict resolution process. Otherwise 
its effect is specific to particular references. 


4.6 Large reads 


Wheel FS provides two cues that enable large-file read op- 
timizations: .WholeFile and .Hotspot. The .WholeFile 
cue instructs WheelFS to pre-fetch the entire file into 
the client cache. The .Hotspot cue instructs the WheelFS 
client to read the file from other clients’ caches, consult- 
ing the file’s primary for a list of clients that likely have 
the data cached. If the application specifies both cues, the 
client will read data in parallel from other clients’ caches. 

Unlike the cues described earlier, .WholeFile and 
-Hotspot are not strictly necessary: a file system could 
potentially learn to adopt the right cue by observing appli- 
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Figure 1: Placement and interaction of WheelFS components. 


cation access patterns. We leave such adaptive behavior to 
future work. 


5 WheelFS Design 


Wheel FS requires a design flexible enough to follow the 
various cues applications can supply. This section presents 
that design, answering the following questions: 


¢ How does WheelFS assign storage responsibility 
for data objects among participating servers? (Sec- 
tion 5.2) 


¢ How does WheelFS ensure an application’s desired 
level of durability for its data? (Section 5.3) 


¢ How does WheelFS provide close-to-open consis- 
tency in the face of concurrent file access and fail- 
ures, and how does it relax consistency to improve 
availability? (Section 5.4) 


¢ How does WheelFS permit peer-to-peer communica- 
tion to take advantage of nearby cached data? (Sec- 
tion 5.5) 


¢ How does WheelFS authenticate users and perform 
access control? (Section 5.6) 


5.1 Components 


A WheelFS deployment (see Figure 1) consists of clients 
and servers; a single host often plays both roles. The 
Wheel FS client software uses FUSE [21] to present the 
distributed file system to local applications, typically in 
/wts. All clients in a given deployment present the same 
file system tree in /wf£s. A WheelFS client communicates 
with WheelFS servers in order to look up file names, cre- 
ate files, get directory listings, and read and write files. 
Each client keeps a local cache of file and directory con- 
tents. 

The configuration service runs independently on a 
small set of wide-area nodes. Clients and servers com- 
municate with the service to learn the set of servers and 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 


47 


48 


which files and directories are assigned to which servers, 
as explained in the next section. 


5.2 Data storage assignment 


Wheel FS servers store file and directory objects. Each ob- 
ject is internally named using a unique numeric ID. A 
file object contains opaque file data and a directory object 
contains a list of name-to-object-ID mappings for the di- 
rectory contents. WheelFS partitions the object ID space 
into 2° slices using the first S' bits of the object ID. 

The configuration service maintains a slice table that 
lists, for each slice currently in use, a replication policy 
governing the slice’s data placement, and a replica list of 
servers currently responsible for storing the objects in that 
slice. A replication policy for a slice indicates from which 
site it must choose the slice’s primary (.Site), and from 
how many distinct sites (.RepSites) it must choose how 
many backups (.RepLevel). The replica list contains the 
current primary for a slice, and Nrz —1 backups. 

Because each unique replication policy requires a 
unique slice identifier, the choice of S limits the maxi- 
mum allowable number of replicas in a policy. In our cur- 
rent implementation S is fairly small (12 bits), and so to 
conserve slice identifiers it limits the maximum number 
of replicas to four. 


5.2.1 Configuration service 


The configuration service is a replicated state machine, 
and uses Paxos [25] to elect a new master whenever its 
membership changes. Only the master can update the 
slice table; it forwards updates to the other members. A 
Wheel FS node is initially configured to know of at least 
one configuration service member, and contacts it to learn 
the full list of members and which is the master. 

The configuration service exports a lock interface to 
WheelFS servers, inspired by Chubby [11]. Through this 
interface, servers can acquire, renew, and release 
locks on particular slices, or fetch a copy of the cur- 
rent slice table. A slice’s lock grants the exclusive right 
to be a primary for that slice, and the right to specify the 
slice’s backups and (for a new slice) its replication pol- 
icy. A lock automatically expires after L seconds unless 
renewed. The configuration service makes no decisions 
about slice policy or replicas. Section 5.3 explains how 
Wheel FS servers use the configuration service to recover 
after the failure of a slice’s primary or backups. 

Clients and servers periodically fetch and cache the 
slice table from the configuration service master. A client 
uses the slice table to identify which servers should be 
contacted for an object in a given slice. If a client encoun- 
ters an object ID for which its cached slice table does not 
list a corresponding slice, the client fetches a new table. 
A server uses the the slice table to find other servers that 
store the same slice so that it can synchronize with them. 
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Servers try to always have at least one slice locked, 
to guarantee they appear in the table of currently locked 
slices; if the maintenance process notices that the server 
holds no locks, it will acquire the lock for a new slice. This 
allows any connected node to determine the current mem- 
bership of the system by taking the union of the replica 
lists of all slices. 


5.2.2 Placing a new file or directory 


When a client creates a new file or directory, it uses the 
placement and durability cues specified by the application 
to construct an appropriate replication policy. If .KeepTo- 
gether is present, it sets the primary site of the policy to 
be the primary site of the object’s parent directory’s slice. 
Next the client checks the slice table to see if an existing 
slice matches the policy; if so, the client contacts the pri- 
mary replica for that slice. If not, it forwards the request 
to arandom server at the site specified by the .Site cue. 

When a server receives a request asking it to create a 
new file or directory, it constructs a replication policy as 
above, and sets its own site to be the primary site for the 
policy. If it does not yet have a lock on a slice matching 
the policy, it generates a new, randomly-generated slice 
identifier and constructs a replica list for that slice, choos- 
ing from the servers listed in the slice table. The server 
then acquires a lock on this new slice from the config- 
uration service, sending along the replication policy and 
the replica list. Once it has a lock on an appropriate slice, 
it generates an object ID for the new object, setting the 
first S' bits to be the slice ID and all other bits to random 
values. The server returns the new ID to the client, and 
the client then instructs the object’s parent directory’s pri- 
mary to add a new entry for the object. Other clients that 
learn about this new object ID from its entry in the par- 
ent directory can use the first S' bits of the ID to find the 
primary for the slice and access the object. 


5.2.3. Write-local policy 


The default data placement policy in WheelFS is to write 
locally, i.e., use a local server as the primary of a newly 
created file (and thus also store one copy of the contents 
locally). This policy works best if each client also runs a 
Wheel FS server. The policy allows writes of large non- 
replicated files at the speed of the local disk, and allows 
such files to be written at one site and read at another with 
just one trip across the wide-area network. 

Modifying an existing file is not always fast, because 
the file’s primary might be far away. Applications desiring 
fast writes should store output in unique new files, so that 
the local server will be able to create a new object ID in 
a slice for which it is the primary. Existing software often 
works this way; for example, the Apache caching proxy 
stores a cached Web page in a unique file named after the 
page’s URL. 


USENIX Association 


USENIX Association 


An ideal default placement policy would make deci- 
sions based on server loads across the entire system; for 
example, if the local server is nearing its storage capac- 
ity but a neighbor server at the same site 1s underloaded, 
WheelFS might prefer writing the file to the neighbor 
rather than the local disk (e.g., as in Porcupine [31]). De- 
veloping such a strategy is future work; for now, applica- 
tions can use cues to control where data are stored. 


5.3. Primary/backup replication 


WheelFS uses primary/backup replication to manage 
replicated objects. The slice assignment designates, for 
each ID slice, a primary and a number of backup servers. 
When a client needs to read or modify an object, by de- 
fault it communicates with the primary. For a file, a mod- 
ification is logically an entire new version of the file con- 
tents; for a directory, a modification affects just one en- 
try. The primary forwards each update to the backups, 
after which it writes the update to its disk and waits for 
the write to complete. The primary then waits for replies 
from Ng;—1 backups, indicating that those backups have 
also written the update to their disks. Finally, the primary 
replies to the client. For each object, the primary executes 
operations one at a time. 

After being granted the lock on a slice initially, the 
WheelFS server must renew it periodically; if the lock ex- 
pires, another server may acquire it to become the primary 
for the slice. Since the configuration service only grants 
the lock on a slice to one server at a time, WheelFS en- 
sures that only one server will act as a primary for a slice 
at any given time. The slice lock time L is a compromise: 
short lock times lead to fast reconfiguration, while long 
lock times allow servers to operate despite the temporary 
unreachability of the configuration service. 

In order to detect failure of a primary or backup, a 
server pings all other replicas of its slices every five min- 
utes. If a primary decides that one of its backups is un- 
reachable, it chooses a new replica from the same site 
as the old replica if possible, otherwise from a random 
site. The primary will transfer the slice’s data to this new 
replica (blocking new updates), and then renew its lock on 
that slice along with a request to add the new replica to the 
replica list in place of the old one. 

If a backup decides the primary is unreachable, it will 
attempt to acquire the lock on the slice from the configura- 
tion service; one of the backups will get the lock once the 
original primary’s lock expires. The new primary checks 
with the backups to make sure that it didn’t miss any ob- 
ject updates (e.g., because Ns,<Narz during a recent up- 
date, and thus not all backups are guaranteed to have com- 
mitted that update). 

A primary’s maintenance process periodically checks 
that the replicas associated with each slice match the 
Slice’s policy; if not, it will attempt to recruit new repli- 


cas at the appropriate sites. If the current primary wishes 
to recruit a new primary at the slice’s correct primary site 
(e.g., a server that had originally been the slice’s primary 
but crashed and rejoined), it will release its lock on the 
slice, and directly contact the chosen server, instructing it 
to acquire the lock for the slice. 


5.4 Consistency 


By default, WheelFS provides close-to-open consistency: 
if one application instance writes a file and waits for 
close() to return, and then a second application in- 
stance open ()s and reads the file, the second applica- 
tion will see the effects of the first application’s writes. 
The reason WheelFS provides close-to-open consistency 
by default is that many applications expect it. 

The WheelFS client has a write-through cache for file 
blocks, for positive and negative directory entries (en- 
abling faster pathname lookups), and for directory and file 
meta-data. A client must acquire an object lease from an 
object’s primary before it uses cached meta-data. Before 
the primary executes any update to an object, it must in- 
validate all leases or wait for them to expire. This step 
may be time-consuming if many clients hold leases on an 
object. 

Clients buffer file writes locally to improve perfor- 
mance. When an application calls close (), the client 
sends all outstanding writes to the primary, and waits 
for the primary to acknowledge them before allowing 
close() to return. Servers maintain a version num- 
ber for each file object, which they increment after each 
close () and after each change to the object’s meta-data. 

When an application open ()s a file and then reads it, 
the WheelFS client must decide whether the cached copy 
of the file Gf any) is still valid. The client uses cached 
file data if the object version number of the cached data 
is the same as the object’s current version number. If the 
client has an unexpired object lease for the object’s meta- 
data, it can use its cached meta-data for the object to find 
the current version number. Otherwise it must contact the 
primary to ask for a new lease, and for current meta-data. 
If the version number of the cached data 1s not current, the 
client fetches new file data from the primary. 

By default, WheelFS provides similar consistency for 
directory operations: after the return of an application sys- 
tem call that modifies a directory (links or unlinks a file 
or subdirectory), applications on other clients are guaran- 
teed to see the modification. WheelFS clients implement 
this consistency by sending directory updates to the direc- 
tory object’s primary, and by ensuring via lease or explicit 
check with the primary that cached directory contents are 
up to date. Cross-directory rename operations in WheelFS 
are not atomic with respect to failures. If a crash occurs at 
the wrong moment, the result may be a link to the moved 
file in both the source and destination directories. 
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The downside to close-to-open consistency is that if a 
primary is not reachable, all operations that consult the 
primary will delay until it revives or a new primary takes 
over. The .EventualConsistency cue allows WheelFS to 
avoid these delays by using potentially stale data from 
backups or local caches when the primary does not re- 
spond, and by sending updates to backups. This can result 
in inconsistent replicas, which the maintenance process 
resolves in the manner described in Section 4.5, leading 
eventually to identical images at all replicas. Without the 
.EventualConsistency cue, a server will reject operations 
on objects for which it is not the primary. 

Applications can specify timeouts on a per-object ba- 
sis using the .MaxTime=T cue. This adds a timeout of 
T’ ms to every operation performed at a server. Without 
.EventualConsistency, a client will return a failure to 
the application if the primary does not respond within T’ 
ms; with .EventualConsistency, clients contact backup 
servers once the timeout occurs. In future work we hope to 
explore how to best divide this timeout when a single file 
system operation might involve contacting several servers 
(e.g.,a create requires talking to the parent directory’s pri- 
mary and the new object’s primary, which could differ). 


5.5 Large reads 


If the application specifies .WholeFile when reading a 
file, the client will pre-fetch the entire file into its cache. 
If the application uses .WholeFile when reading directory 
contents, WheelFS will pre-fetch the meta-data for all of 
the directory’s entries, so that subsequent lookups can be 
serviced from the cache. 

To implement the .Hotspot cue, a file’s primary main- 
tains a soft-state list of clients that have recently cached 
blocks of the file, including which blocks they have 
cached. A client that reads a file with .Hotspot asks the 
server for entries from the list that are near the client; the 
server chooses the entries using Vivaldi coordinates [15]. 
The client uses the list to fetch each block from a nearby 
cached copy, and informs the primary of successfully 
fetched blocks. 

If the application reads a file with both .WholeFile and 
-Hotspot, the client will issue block fetches in parallel to 
multiple other clients. It pre-fetches blocks in a random 
order so that clients can use each others’ caches even if 
they start reading at the same time [6]. 


5.6 Security 


WheelFS enforces three main security properties. First, 
a given WheelFS deployment ensures that only autho- 
rized hosts participate as servers. Second, WheelFS en- 
sures that requests come only from users authorized to 
use the deployment. Third, WheelFS enforces user-based 
permissions on requests from clients. WheelFS assumes 
that authorized servers behave correctly. A misbehaving 
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client can act as any user that has authenticated them- 
selves to WheelFS from that client, but can only do things 
for which those users have permission. 

All communication takes place through authenticated 
SSH channels. Each authorized server has a public/pri- 
vate key pair which it uses to prove its identity. A central 
administrator maintains a list of all legitimate server pub- 
lic keys in a deployment, and distributes that list to ev- 
ery server and client. Servers only exchange inter-server 
traffic with hosts authenticated with a key on the list, and 
clients only send requests to (and use responses from) au- 
thentic servers. 

Each authorized user has a public/private key pair; 
WheelFS uses SSH’s existing key management support. 
Before a user can use WheelFS on a particular client, 
the user must reveal his or her private key to the client. 
The list of authorized user public keys 1s distributed to all 
servers and clients as a file in WheelFS. A server accepts 
only client connections signed by an authorized user key. 
A server checks that the authenticated user for a request 
has appropriate permissions for the file or directory being 
manipulated—each object has an associated access con- 
trol list in its meta-data. A client dedicated to a particular 
distributed application stores its “user” private key on its 
local disk. 

Clients check data received from other clients against 
server-supplied SHA-256 checksums to prevent clients 
from tricking each other into accepting unauthorized 
modifications. A client will not supply data from its cache 
to another client whose authorized user does not have read 
permissions. 

There are several planned improvements to this security 
setup. One is an automated mechanism for propagating 
changes to the set of server public keys, which currently 
need to be distributed manually. Another is to allow the 
use of SSH Agent forwarding to allow users to connect se- 
curely without storing private keys on client hosts, which 
would increase the security of highly privileged keys in 
the case where a client is compromised. 


6 Applications 


Wheel FS is designed to help the construction of wide-area 
distributed applications, by shouldering a significant part 
of the burden of managing fault tolerance, consistency, 
and sharing of data among sites. This section evaluates 
how well WheelFS fulfills that goal by describing four 
applications that have been built using it. 


All-Pairs-Pings. All-Pairs-Pings [37] monitors the net- 
work delays among a set of hosts. Figure 2 shows a sim- 
ple version of All-Pairs-Pings built from a shell script and 
WheelFS, to be invoked by each host’s cron every few 
minutes. The script pings the other hosts and puts the re- 
sults in a file whose name contains the local host name 
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1 FILE=‘date +%s*. ‘hostname *.dat 

D=/wfs/ping 

BIN=SD/bin/.EventualConsistency/ 
.MaxTime=5000/.HotSpot/.WholeFile 

DATA=SD/ .EventualConsistency/dat 

mkdir -p SDATA/ ‘hostname * 

cd SDATA/ ‘hostname * 

xargs -nl SBIN/ping -c 10 < 
SD/nodes > /tmp/S$FILE 

8 cp /tmp/SFILE $FILE 

9 rm /tmp/SFILE 

10 if [ ‘hostname * 

11 mkdir <p SD/res 

12 SBIN/process * > S$D/res/‘date +%s*.o 

13. fi 


OW N 


TYAN OB 


= "nodel" |]; then 


Figure 2: A shell script implementation of All-Pairs-Pings us- 
ing Wheel FS. 


and the current time. After each set of pings, a coordina- 
tor host (“nodel”’) reads all the files, creates a summary 
using the program process (not shown), and writes the 
output to a results directory. 

This example shows that WheelFS can help keep sim- 
ple distributed tasks easy to write, while protecting the 
tasks from failures of remote nodes. WheelFS stores each 
host’s output on the host’s own WheelFS server, so that 
hosts can record ping output even when the network is 
broken. WheelFS automatically collects data files from 
hosts that reappear after a period of separation. Finally, 
WheelFS provides each host with the required binaries 
and scripts and the latest host list file. Use of WheelFS in 
this script eliminates much of the complexity of a previ- 
ous All-Pairs-Pings program, which explicitly dealt with 
moving files among nodes and coping with timeouts. 


Distributed Web cache. This application consists 
of hosts running Apache 2.2.4 caching proxies 
(mod.disk-cache). The Apache configuration file 
places the cache file directory on WheelFS: 


/wfs/.EventualConsistency/.MaxTime=1000/ 
-Hotspot/cache/ 


When the Apache proxy can’t find a page in the cache 
directory on WheelFS, it fetches the page from the ori- 
gin Web server and writes a copy in the WheelFS di- 
rectory, as well as serving it to the requesting browser. 
Other cache nodes will then be able to read the page from 
WheelFS, reducing the load on the origin Web server. 
The .Hotspot cue copes with popular files, directing the 
WheelFS clients to fetch from each others’ caches to 1n- 
crease total throughput. The .EventualConsistency cue 
allows clients to create and read files even if they cannot 
contact the primary server. The .MaxTime cue instructs 


WheelFS to return an error if it cannot find a file quickly, 
causing Apache to fetch the page from the origin Web 
server. If WheelFS returns an expired version of the file, 
Apache will notice by checking the HTTP header in the 
cache file, and it will contact the origin Web server for a 
fresh copy. 

Although this distributed Web cache implementation is 
fully functional, it does lack features present in other sim- 
ilar systems. For example, CoralCDN uses a hierarchy of 
caches to avoid overloading any single tracker node when 
a file is popular. 


Mail service. The goal of Wheemail, our WheelFS-based 
mail service, is to provide high throughput by spreading 
the work over many sites, and high availability by replicat- 
ing messages on multiple sites. Wheemail provides SMTP 
and IMAP service from a set of nodes at these sites. Any 
node at any site can accept a message via SMTP for any 
user; in most circumstances a user can fetch mail from the 
IMAP server on any node. 

Each node runs an unmodified sendmail process to ac- 
cept incoming mail. Sendmail stores each user’s messages 
in a WheelFS directory, one message per file. The sep- 
arate files help avoid conflicts from concurrent message 
arrivals. A user’s directory has this path: 


/wfs/mail/.EventualConsistency /.Site=X / 
-KeepTogether /.RepSites=2 /user/Mail/ 


Each node runs a Dovecot IMAP server [17] to serve users 
their messages. A user retrieves mail via a nearby node 
using a locality-preserving DNS service [20]. 

The .EventualConsistency cue allows a user to read 
mail via backup servers when the primary for the user’s 
directory is unreachable, and allows incoming mail to be 
stored even if primary and all backups are down. The 
.Site=X cue indicates that a user’s messages should be 
stored at site X, chosen to be close to the user’s usual lo- 
cation to reduce network delays. The .KeepTogether cue 
causes all of a user’s messages to be stored on a single 
replica set, reducing latency for listing the user’s mes- 
sages [31]. Wheemail uses the default replication level of 
three but uses .RepSites=2 to keep at least one off-site 
replica of each mail. To avoid unnecessary replication, 
Dovecot uses .RepLevel=1 for much of its internal data. 

Wheemail has goals similar to those of Porcupine [31], 
namely, to provide scalable email storage and retrieval 
with high availability. Unlike Porcupine, Wheemail runs 
on a set of wide-area data centers. Replicating emails over 
multiple sites increases the service’s availability when a 
single site goes down. Porcupine consists of custom-built 
storage and retrieval components. In contrast, the use of a 
wide-area file system in Wheemail allows it to reuse exist- 
ing software like sendmail and Dovecot. Both Porcupine 
and Wheemail use eventual consistency to increase avail- 
ability, but Porcupine has a better reconciliation policy as 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 


SI 


D2 


its “deletion record” prevents deleted emails from reap- 
pearing. 

File Distribution. A set of many WheelFS clients can co- 
operate to fetch a file efficiently using the large read cues: 


/wfs/.WholeFile/.Hotspot/largefile 


Efficient file distribution may be particularly useful 
for binaries 1n wide-area experiments, in the spirit of 
Shark [6] and CoBlitz [29]. Like Shark, WheelFS uses co- 
operative caching to reduce load on the file server. Shark 
further reduces the load on the file server by using a dis- 
tributed index to keep track of cached copies, whereas 
WheelFS relies on the primary server to track copies. 
Unlike WheelFS or Shark, CoBlitz is a CDN, so files 
cannot be directly accessed through a mounted file sys- 
tem. CoBlitz caches and shares data between CDN nodes 
rather than between clients. 


7 Implementation 


The WheelFS prototype consists of 19,000 lines of C++ 
code, using pthreads and STL. In addition, the implemen- 
tation uses a new RPC library (3,800 lines) that imple- 
ments Vivaldi network coordinates [15]. 

The WheelFS client uses FUSE’s “low level” interface 
to get access to FUSE identifiers, which it translates into 
WheelFS-wide unique object IDs. The WheelFS cache 
layer in the client buffers writes in memory and caches 
file blocks in memory and on disk. 

Permissions, access control, and secure SSH con- 
nections are implemented. Distribution of public keys 
through WheelF%S is not yet implemented. 


$8 Evaluation 


This section demonstrates the following points about the 
performance and behavior of WheelFS: 


e For some storage workloads common in distributed 
applications, WheelFS offers more scalable perfor- 
mance than an implementation of NFSv4. 


¢ WheelFS achieves reasonable performance under a 
range of real applications running on a large, wide- 
area testbed, as well as on a controlled testbed using 
an emulated network. 


¢ WheelFS provides high performance despite net- 
work and server failures for applications that indicate 
via cues that they can tolerate relaxed consistency. 


WheelFS offers data placement options that allow 
applications to place data near the users of that data, 
without the need for special application logic. 


Wheel FS offers client-to-client read options that help 
counteract wide-area bandwidth constraints. 
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¢ WheelFS offers an interface on which it is quick and 
easy to build real distributed applications. 


8.1 Experimental setup 


All scenarios use WheelFS configured with 64 KB blocks, 
a 100 MB in-memory client LRU block cache supple- 
mented by an unlimited on-disk cache, one minute object 
leases, a lock time of 1 = 2 minutes, 12-bit slice IDs, 32- 
bit object IDs, and a default replication level of three (the 
responsible server plus two replicas), unless stated oth- 
erwise. Communication takes place over plain TCP, not 
SSH, connections. Each WheelFS node runs both a stor- 
age server and a client process. The configuration service 
runs on five nodes distributed across three wide-area sites. 

We evaluate our WheelFS prototype on two testbeds: 
PlanetLab [7] and Emulab [42]. For PlanetLab experi- 
ments, we use up to 250 nodes geographically spread 
across the world at more than 140 sites (we determine the 
site of a node based on the domain portion of its host- 
name). These nodes are shared with other researchers and 
their disks, CPU, and bandwidth are often heavily loaded, 
showing how WheelFS performs in the wild. These nodes 
run a Linux 2.6 kernel and FUSE 2.7.3. We run the config- 
uration service on a private set of nodes running at MIT, 
NYU, and Stanford, to ensure that the replicated state ma- 
chine can log operations to disk and respond to requests 
quickly (fsync ()s on PlanetLab nodes can sometimes 
take tens of seconds). 

For more control over the network topology and host 
load, we also run experiments on the Emulab [42] testbed. 
Each Emulab host runs a standard Fedora Core 6 Linux 
2.6.22 kernel and FUSE version 2.6.5, and has a 3 GHz 
CPU. We use a WAN topology consisting of 5 LAN clus- 
ters of 3 nodes each. Each LAN cluster has 100 Mbps, 
sub-millisecond links between each node. Clusters con- 
nect to the wide-area network via a single bottleneck link 
of 6 Mbps, with 100 ms RTTs between clusters. 


8.2 Scalability 


We first evaluate the scalability of WheelFS on a mi- 
crobenchmark representing a workload common to dis- 
tributed applications: many nodes reading data written by 
other nodes in the system. For example, nodes running a 
distributed Web cache over a shared storage layer would 
be reading and serving pages written by other nodes. 
In this microbenchmark, N clients mount a shared file 
system containing NV directories, either using NFSv4 or 
WheelFS. Each directory contains ten 1 MB files. The 
clients are PlanetLab nodes picked at random from the 
set of nodes that support both mounting both FUSE and 
NES file systems. This set spans a variety of nodes dis- 
tributed across the world, from nodes at well-connected 
educational institutions to nodes behind limited-upload 
DSL lines. Each client reads ten random files from the file 
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Figure 3: The median time for a set of PlanetLab clients to read 
a 1 MB file, as a function of the number of concurrently reading 
nodes. Also plots the median time for a set of local processes to 
read | MB files from the NFS server’s local disk through ext 3. 


system in sequence, and measures the read latency. The 
clients all do this at the same time. 

For WheelFS, each client also acts as a server, and is 
the primary for one directory and all files within that di- 
rectory. WheelFS clients do not read files for which they 
are the primary, and no file is ever read twice by the same 
node. The NFS server is a machine at MIT running De- 
bian’s nfs-kernel-server version 1.0.10-6 using the default 
configuration, with a 2.8 GHz CPU and a SCSI hard drive. 

Figure 3 shows the median time to read a file as N 
varies. For Wheel FS, a very small fraction of reads fail be- 
cause not all pairs of PlanetLab nodes can communicate; 
these reads are not included in the graph. Each point on 
the graph is the median of the results of at least one hun- 
dred nodes (e.g., a point showing the latency for five con- 
current nodes represents the median reported by all nodes 
across twenty different trials). 

Though the NFS server achieves lower latencies when 
there are few concurrent clients, its latency rises sharply as 
the number of clients grows. This rise occurs when there 
are enough clients, and thus files, that the files do not fit 
in the server’s 1GB file cache. Figure 3 also shows results 
for N concurrent processes on the NFS server, accessing 
the ext3 file system directly, showing a similar latency 
increase after 100 clients. WheelFS latencies are not af- 
fected by the number of concurrent clients, since WheelFS 
spreads files and thus the load across many servers. 


$.3. Distributed Web Cache 


Performance under normal conditions. These exper- 
iments compare the performance of CoralCDN and the 
WheelFS distributed Web cache (as described in Sec- 
tion 6, except with .MaxTime=2000 to adapt to Planet- 
Lab’s characteristics). The main goal of the cache is to 
reduce load on target Web servers via caching, and secon- 
darily to provide client browsers with reduced latency and 
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Figure 4: The aggregate client service rate and origin server 
load for both CoralCDN and the WheelFS-based Web cache, 
running on PlanetLab. 
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Figure 5: The CDF for the client request latencies of both 
CoralCDN and the WheelFS-based Web cache, running on Plan- 
etLab. 


increased availability. 


These experiments use forty nodes from PlanetLab 
hosted at .edu domains, spread across the continental 
United States. A Web server, located at NYU behind an 
emulated slow link (shaped using Click [24] to be 400 
Kbps and have a 100 ms delay), serves 100 unique 41KB 
Web pages. Each of the 40 nodes runs a Web proxy. 
For each proxy node there is another node less than 10 
ms away that runs a simulated browser as a Web client. 
Each Web client requests a sequence of randomly selected 
pages from the NYU Web server. This experiment, in- 
spired by one in the CoralCDN paper [19], models a flash 
crowd where a set of files on an under-provisioned server 
become popular very quickly. 


Figures 4 and 5 show the results of these experiments. 
Figure 4 plots both the total rate at which the proxies send 
requests to the origin server and the total rate at which 
the proxies serve Web client requests (the y-axis is a log 
scale). WheelFS takes about twice as much time as Coral- 
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Figure 6: The WheelFS-based Web cache running on Emulab 
with failures, using the .EventualConsistency cue. Gray regions 
indicate the duration of a failure. 


CDN to reduce the origin load to zero; both reach simi- 
lar sustained aggregate Web client service rates. Figure 5 
plots the cumulative distribution function (CDF) of the 
request latencies seen by the Web clients. WheelFS has 
somewhat higher latencies than CoralCDN. 

CoralCDN has higher performance because it incor- 
porates many application-specific optimizations, whereas 
the WheelFS-based cache is built from more general- 
purpose components. For instance, a CoralCDN proxy 
pre-declares its intent to download a page, preventing 
other nodes from downloading the same page; Apache, 
running on WheelFS, has no such mechanism, so several 
nodes may download the same page before Apache caches 
the data in WheelFS. Similar optimizations could be im- 
plemented in Apache. 


Performance under failures. Wide-area network prob- 
lems that prevent WheelFS from contacting storage nodes 
should not translate into long delays; if a proxy cannot 
quickly fetch a cached page from WheelFS, it should 
ask the origin Web server. As discussed in Section 6, the 
cues .EventualConsistency and .MaxTime=1000 yield 
this behavior, causing open() to either find a copy of 
the desired file or fail in one second. Apache fetches from 
the origin Web server if the open () fails. 

To test how failures affect WheelFS application perfor- 
mance, we ran a distributed Web cache experiment on the 
Emulab topology in Section 8.1, where we could control 
the network’s failure behavior. At each of the five sites 
there are three WheelFS Web proxies. Each site also has a 
Web client, which connects to the Web proxies at the same 
site using a 10 Mbps, 20 ms link, issuing five requests at a 
time. The origin Web server runs behind a 400 Kbps link, 
with 150 ms RTTs to the Web proxies. 

Figures 6 and 7 compare failure performance of 
WheelFS with the above cues to failure performance of 
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Figure 7: The WheelFS-based Web cache running on Emulab 
with failures, with close-to-open consistency. Gray regions indi- 
cate the duration of a failure. 
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Figure 8: The aggregate client service rate and origin server 
load for the WheelFS-based Web cache, running on Emulab, 
without failures. 


close-to-open consistency with 1-second timeouts (.Max- 
Time=1000). The y-axes of these graphs are log-scale. 
Each minute one wide-area link connecting an entire site 
to the rest of the network fails for thirty seconds and then 
revives. This failure period is not long enough to cause 
servers at the failed site to lose their slice locks. Web 
clients maintain connectivity to the proxies at their lo- 
cal site during failures. For comparison, Figure 8 shows 
WheelFS’s performance on this topology when there are 
no failures. 


When a Web client requests a page from a proxy, the 
proxy must find two pieces of information in order to find 
a copy of the page (if any) in WheelFS: the object ID to 
which the page’s file name resolves, and the file content 
for that object ID. The directory information and the file 
content can be on different WheelFS servers. For each 
kind of information, if the proxy’s WheelFS client has 
cached the information and has a valid lease, the WheelFS 
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Figure 9: The throughput of Wheemail compared with the static 
system, on the Emulab testbed. 


client need not contact a server. If the WheelFS client 
doesn’t have information with a valid lease, and is us- 
ing eventual consistency, it tries to fetch the information 
from the primary; if that fails (after a one-second time- 
out), the WheelFS client will try fetch from a backup; if 
that fails, the client will use locally cached information (if 
any) despite an expired lease; otherwise the open () fails 
and the proxy fetches the page from the origin server. If a 
Wheel FS client using close-to-open consistency does not 
have cached data with a valid lease, it first tries to contact 
the primary; if that fails (after timeout), the proxy must 
fetch the page from the origin Web server. 

Figure 6 shows the performance of the WheelFS Web 
cache with eventual consistency. The graph shows a pe- 
riod of time after the initial cache population. The gray re- 
gions indicate when a failure is present. Throughput falls 
as WheelFS clients encounter timeouts to servers at the 
failed site, though the service rate remains near 100 re- 
quests/sec. The small load spikes at the origin server af- 
ter a failure reflect requests queued up in the network by 
the failed site while it is partitioned. Figure 7 shows that 
with close-to-open consistency, throughput falls signifi- 
cantly during failures, and hits to the origin server increase 
greatly. This shows that a cooperative Web cache, which 
does not require strong consistency, can use WheelFS’s 
semantic cues to perform well under wide-area condi- 
tions. 


$.4 Mail 


The Wheemail system described in Section 6 has a num- 
ber of valuable properties such as the ability to serve and 
accept a user’s mail from any of multiple sites. This sec- 
tion explores the performance cost of those properties by 
comparing to a traditional mail system that lacks those 
properties. 

IMAP and SMTP are stressful file system benchmarks. 
For example, an IMAP server reading a Maildir-formatted 
inbox and finding no new messages generates over 600 
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Figure 10: The average latencies of individual SMTP requests, 
for both Wheemail and the static system, on Emulab. 


FUSE operations. These primarily consist of lookups on 
directory and file names, but also include more than 30 di- 
rectory operations (creates/links/unlinks/renames), more 
than 30 small writes, and a few small reads. A single 
SMTP mail delivery generates over 60 FUSE operations, 
again consisting mostly of lookups. 

In this experiment we use the Emulab network topol- 
ogy described in Section 8.1 with 5 sites. Each site has 
a 1 Mbps link to a wide-area network that connects all 
the sites. Each site has three server nodes that each run a 
WheelFS server, a WheelFS client, an SMTP server, and 
an IMAP server. Each site also has three client nodes, 
each of which runs multiple load-generation threads. A 
load-generation thread produces a sequence of SMTP and 
IMAP requests as fast as it can. 90% of requests are 
SMTP and 10% are IMAP. User mailbox directories are 
randomly and evenly distributed across sites. The load- 
generation threads pick users and message sizes with 
probabilities from distributions derived from SMTP and 
IMAP logs of servers at NYU; there are 47699 users, and 
the average message size is 6.9 KB. We measure through- 
put in requests/second, with an increasing number of con- 
current client threads. 

When measuring WheelFS, a load-generating thread at 
a given site only generates requests from users whose mail 
is stored at that site (the user’s “home” site), and connects 
only to IMAP and SMTP servers at the local site. Thus 
an IMAP request can be handled entirely within a home 
site, and does not generate any wide-area traffic (during 
this experiment, each node has cached directory lookup 
information for the mailboxes of all users at its site). A 
load-generating thread generates mail to random users, 
connecting to a SMTP server at the same site; that server 
writes the messages to the user’s directory in WheelFS, 
which is likely to reside at a different site. In this experi- 
ment, user mailbox directories are not replicated. 

We compare against a “static” mail system in which 
users are partitioned over the 15 server nodes, with the 
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Figure 11: CDF of client download times of a 50 MB file us- 
ing BitTorrent and WheelFS with the .Hotspot and .WholeFile 
cues, running on Emulab. Also shown is the time for a single 
client to download 50 MB directly using tt cp. 


SMTP and IMAP servers on each server node storing mail 
on a local disk file system. The load-generator threads at 
each site only generate IMAP requests for users at the 
same site, so IMAP traffic never crosses the wide area net- 
work. When sending mail, a load-generating client picks 
a random recipient, looks up that user’s home server, and 
makes an SMTP connection to that server, often across the 
wide-area network. 

Figure 9 shows the aggregate number of requests served 
by the entire system per second. The static system can 
sustain 112 requests per second. Each site’s | Mbps wide- 
area link is the bottleneck: since 90% of the requests are 
SMTP (message with an average size 6.85 KB), and 80% 
of those go over the wide area, the system as a whole is 
sending 4.3 Mbps across a total link capacity of 5 Mbps, 
with the remaining wide-area bandwidth being used by 
the SMTP and TCP protocols. 

Wheemail achieves up to 50 requests per second, 45% 
of the static system’s performance. Again the 1 Mbps 
WAN links are the bottleneck: for each SMTP request, 
WheelFS must send 11 wide-area RPCs to the target 
user’s mailbox site, adding an overhead of about 40% to 
the size of the mail message, 1n addition to the continuous 
background traffic generated by the maintenance process, 
Slice lock renewal, Vivaldi coordinate measurement, and 
occasional lease invalidations. 

Figure 10 shows the average latencies of individual 
SMTP requests for Wheemail and the static system, as the 
number of clients varies. Wheemail’s latencies are higher 
than those of the static system by nearly 60%, attributable 
to traffic overhead generated by Wheel FS. 

Though the static system outperforms Wheemail for 
this benchmark, Wheemail provides many desirable prop- 
erties that the static system lacks. Wheemail transparently 
redirects a receiver’s mail to its home site, regardless of 
where the SMTP connection occurred; additional storage 
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Application | LoC | Reuses 





CDN 1 | Apache+mod_disk_cache 
Mail service 4 | Sendmail+Procmail+Dovecot 
File distribution | N/A | Built-in to WheelFS 
All-Pairs-Pings 13 | N/A 


Table 3: Number of lines of changes to adapt applications to 
use Wheel FS. 


can be added to the system without major manual recon- 
figuration; and Wheemail can be configured to offer toler- 
ance to site failures, all without any special logic having 
to be built into the mail system itself. 


$8.5 File distribution 


Our file distribution experiments use a WheelFS network 
consisting of 15 nodes, spread over five LAN clusters con- 
nected by the emulated wide-area network described in 
Section 8.1. Nodes attempt to read a 50 MB file simulta- 
neously (initially located at an originating, 16°” WheelFS 
node that is in its own cluster) using the .Hotspot and 
.WholeFile cues. For comparison, we also fetch the file 
using BitTorrent [14] (the Fedora Core distribution of ver- 
sion 4.4.0-5). We configured BitTorrent to allow unlimited 
uploads and to use 64 KB blocks like WheelFS (in this 
test, BitTorrent performs strictly worse with its usual de- 
fault of 256 KB blocks). 

Figure 11 shows the CDF of the download times, under 
WheelFS and BitTorrent, as well as the time for a single 
direct transfer of 50 MB between two wide-area nodes (73 
seconds). WheelFS’s median download time is 168 sec- 
onds, showing that WheelFS’s implementation of cooper- 
ative reading is better than BitTorrent’s: BitTorrent clients 
have a median download time of 249 seconds. The im- 
provement is due to WheelFS clients fetching from nearby 
nodes according to Vivaldi coordinates; BitTorrent does 
not use a locality mechanism. Of course, both solutions 
offer far better download times than 15 simultaneous di- 
rect transfers from a single node, which in this setup has 
a median download time of 892 seconds. 


8.6 Implementation ease 


Table 3 shows the number of new or modified lines of 
code (LoC) we had to write for each application (exclud- 
ing Wheel FS itself). Table 3 demonstrates that developers 
can benefit from a POSIX file system interface and cues 
to build wide-area applications with ease. 


9 Related Work 


There is a humbling amount of past work on distributed 
file systems, wide-area storage in general and the tradeoffs 
of availability and consistency. PRACTI [8] is a recently- 
proposed framework for building storage systems with ar- 
bitrary consistency guarantees (as in TACT [43]). Like 
PRACTI, WheelFS maintains flexibility by separating 
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policies from mechanisms, but it has a different goal. 
While PRACTI and its recent extension PADS [9] are 
designed to simplify the development of new storage or 
file systems, WheelFS itself is a flexible file system de- 
signed to simplify the construction of distributed appli- 
cations. As a result, WheelFS’s cues are motivated by the 
specific needs of applications (such as the .Site cue) while 
PRACTI’s primitives aim at covering the entire spectrum 
of design tradeoffs (e.g., strong consistency for operations 
spanning multiple data objects, which WheelFS does not 
support). 

Most distributed file systems are designed to support 
a workload generated by desktop users (e.g., NFS [33], 
AFS [34], Farsite [2], xFS [5], Frangipani [12], Ivy [27]). 
They usually provide a consistent view of data, while 
sometimes allowing for disconnected operation (e.g., 
Coda [35] and BlueFS [28]). Cluster file systems such as 
GFS [22] and Ceph [41] have demonstrated that a dis- 
tributed file system can dramatically simplify the con- 
struction of distributed applications within a large cluster 
with good performance. Extending the success of clus- 
ter file systems to the wide-area environment continues 
to be difficult due to the tradeoffs necessary to combat 
wide-area network challenges. Similarly, Sinfonia [3] of- 
fers highly-scalable cluster storage for infrastructure ap- 
plications, and allows some degree of inter-object con- 
sistency via lightweight transactions. However, it targets 
storage at the level of individual pieces of data, rather 
than files and directories like WheelFS, and uses proto- 
cols like two-phase commit that are costly in the wide 
area. Shark [6] shares with WheelFS the goal of allowing 
client-to-client data sharing, though its use of a central- 
ized server limits its scalability for applications in which 
nodes often operate on independent data. 

Successful wide-area storage systems generally exploit 
application-specific knowledge to make decisions about 
tradeoffs in the wide-area environment. As a result, many 
wide-area applications include their own storage lay- 
ers [4, 14, 19, 31] or adapt an existing system [29, 40]. 
Unfortunately, most existing storage systems, even more 
general ones like OceanStore/Pond [30] or S3 [1], are only 
suitable for a limited range of applications and still require 
a large amount of code to use. DHTs are a popular form 
of general wide-area storage, but, while DHTs all offer 
a similar interface, they differ widely in implementation. 
For example, UsenetDHT [36] and CoralCDN [19] both 
use a DHT, but their DHTs differ in many details and are 
not interchangeable. 

Some wide-area storage systems offer configuration 
options in order to make them suitable for a larger range of 
applications. Amazon’s Dynamo [16] works across multi- 
ple data centers and provides developers with two knobs: 
the number of replicas to read or to write, in order to con- 
trol durability, availability and consistency tradeoffs. By 


contrast, WheelFS’s cues are at a higher level (e.g., even- 
tual consistency versus close-to-open consistency). Total 
Recall [10] offers a per-object flexible storage API and 
uses a primary/backup architecture like WheelFS, but as- 
sumes no network partitions, focuses mostly on availabil- 
ity controls, and targets a more dynamic environment. 
Bayou [39] and Pangaea [32] provide eventual consis- 
tency by default while the latter also allows the use of a 
“red button” to wait for the acknowledgment of updates 
from all replicas explicitly. Like Pangaea and Dynamo, 
WheelFS provides flexible consistency tradeoffs. Addi- 
tionally, WheelFS also provides controls in other cate- 
gories (such as data placement, large reads) to suit the 
needs of a variety of applications. 


10 Conclusion 


Applications that distribute data across multiple sites have 
varied consistency, durability, and availability needs. A 
shared storage system able to meet this diverse set of 
needs would ideally provide applications a flexible and 
practical interface, and handle applications’ storage needs 
without sacrificing much performance when compared to 
a specialized solution. This paper describes WheelFS, a 
wide-area storage system with a traditional POSIX inter- 
face augmented by cues that allow distributed applications 
to control consistency and fault-tolerance tradeoffs. 

WheelFS offers a small set of cues in four categories 
(placement, durability, consistency, and large reads), 
which we have found to work well for many common dis- 
tributed workloads. We have used a WheelFS prototype 
as a building block in a variety of distributed applications, 
and evaluation results show that it meets the needs of 
these applications while permitting significant code reuse 
of their existing, non-distributed counterparts. We hope to 
make an implementation of WheelFS available to devel- 
opers in the near future. 
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Abstract 


This paper presents PADS, a policy architecture for build- 
ing distributed storage systems. A policy architecture has 
two aspects. First, a common set of mechanisms that al- 
low new systems to be implemented simply by defining 
new policies. Second, a structure for how policies, them- 
selves, should be specified. In the case of distributed 
storage systems, PADS defines a data plane that pro- 
vides a fixed set of mechanisms for storing and trans- 
mitting data and maintaining consistency information. 
PADS requires a designer to define a control plane pol- 
icy that specifies the system-specific policy for orches- 
trating flows of data among nodes. PADS then divides 
control plane policy into two parts: routing policy and 
blocking policy. The PADS prototype defines a concise 
interface between the data and control planes, it provides 
a declarative language for specifying routing policy, and 
it defines a simple interface for specifying blocking pol- 
icy. We find that PADS greatly reduces the effort to de- 
sign, implement, and modify distributed storage systems. 
In particular, by using PADS we were able to quickly 
construct a dozen significant distributed storage systems 
spanning a large portion of the design space using just a 
few dozen policy rules to define each system. 


1 Introduction 


Our goal is to make it easy for system designers to con- 
struct new distributed storage systems. Distributed stor- 
age systems need to deal with a wide range of hetero- 
geneity in terms of devices with diverse capabilities (e.g., 
phones, set-top-boxes, laptops, servers), workloads (e.g., 
streaming media, interactive web services, private stor- 
age, widespread sharing, demand caching, preloading), 
connectivity (e.g., wired, wireless, disruption tolerant), 
and environments (e.g., mobile networks, wide area net- 
works, developing regions). To cope with these varying 
demands, new systems are developed [12, 14, 19, 21, 
22, 30], each making design choices that balance perfor- 
mance, resource usage, consistency, and availability. Be- 
cause these tradeoffs are fundamental [7, 16, 34], we do 
not expect the emergence of a single “hero” distributed 
storage system to serve all situations and end the need 
for new systems. 

This paper presents PADS, a policy architecture that 


simplifies the development of distributed storage sys- 
tems. A policy architecture has two aspects. 

First, a policy architecture defines a common set of 
mechanisms and allows new systems to be implemented 
simply by defining new policies. PADS casts its mech- 
anisms as part of a data plane and policies as part of a 
control plane. The data plane encapsulates a set of com- 
mon mechanisms that handle the details of storing and 
transmitting data and maintaining consistency informa- 
tion. System designers then build storage systems by 
specifying a control plane policy that orchestrates data 
flows among nodes. 

Second, a policy architecture defines a framework for 
specifying policy. In PADS, we separate control plane 
policy into routing and blocking policy. 


e Routing policy: Many of the design choices of dis- 
tributed storage systems are simply routing decisions 
about data flows between nodes. These decisions pro- 
vide answers to questions such as: “When and where 
to send updates?” or “Which node to contact on a 
read miss?’”’, and they largely determine how a sys- 
tem meets its performance, availability, and resource 
consumption goals. 


e Blocking policy: Blocking policy specifies predicates 
for when nodes must block incoming updates or lo- 
cal read/write requests to maintain system invariants. 
Blocking is important for meeting consistency and 
durability goals. For example, a policy might block 
the completion of a write until the update reaches at 
least 3 other nodes. 


The PADS prototype is an instantiation of this archi- 
tecture. It provides a concise interface between the con- 
trol and data planes that is flexible, efficient, and yet sim- 
ple. For routing policy, designers specify an event-driven 
program over an API comprising a set of actions that set 
up data flows, a set of triggers that expose local node in- 
formation, and the abstraction of stored events that store 
and retrieve persistent state. To facilitate the specifi- 
cation of event-driven routing, the prototype defines a 
domain-specific language that allows routing policy to 
be written as a set of declarative rules. For defining a 
control plane’s blocking policy, PADS defines five block- 
ing points in the data plane’s processing of read, write, 
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the set of features covered by the implementation. *Note that the original implementations of some systems provide interfaces that 
differ from the object store or file system interfaces we provide in our prototypes. 


and receive-update actions; at each blocking point, a de- 
signer specifies blocking predicates that indicate when 
the processing of these actions must block. 


Ultimately, the evidence for PADS’s usefulness is sim- 
ple: two students used PADS to construct a dozen dis- 
tributed storage systems summarized in Figure | in a few 
months. PADS’s ability to support these systems (1) pro- 
vides evidence supporting our high-level approach and 
(2) suggests that the specific APIs of our PADS prototype 
adequately capture the key abstractions for building dis- 
tributed storage systems. Notably, in contrast with the 
thousands of lines of code it typically takes to construct 
such a system using standard practice, given the PADS 
prototype it requires just 6-75 routing rules and a hand- 
ful of blocking conditions to define each new system with 
PADS. 


Similarly, we find it easy to add significant new 
features to PADS systems. For example, we add co- 
operative caching [5] to Coda by adding 13 rules. 


This flexibility comes at a modest cost to absolute per- 
formance. Microbenchmark performance of an imple- 
mentation of one system (P-Coda) built on our user-level 
Java PADS prototype is within ten to fifty percent of the 
original system (Coda [14]) in most cases and 3.3 times 
worse in the worst case we measured. 


A key issue in interpreting Figure | is understanding 
how complete or realistic these PADS implementations 
are. The PADS implementations are not bug-compatible 
recreations of every detail of the original systems, but we 
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believe they do capture the overall architecture of these 
designs by storing approximately the same data on each 
node, by sending approximately the same data across the 
same network links, and by enforcing the same consis- 
tency and durability semantics; we discuss our definition 
of architectural equivalence in Section 6. We also note 
that our PADS implementations are sufficiently complete 
to run file system benchmarks and that they handle im- 
portant and challenging real world details like configura- 
tion files and crash recovery. 


2 PADS overview 


Separating mechanism from policy is an old idea. As 
Figure 2 illustrates, PADS does so by defining a data 
plane that embodies the basic mechanisms needed for 
storing data, sending and receiving data, and maintain- 
ing consistency information. PADS then casts policy 
as defining a control plane that orchestrates data flow 
among nodes. This division is useful because it allows 
the designer to focus on high-level specification of con- 
trol plane policy rather than on implementation of low- 
level data storage, bookkeeping, and transmission de- 
tails. 

PADS must therefore specify an interface between the 
data plane and the control plane that is flexible and effi- 
cient so that it can accommodate a wide design space. At 
the same time, the interface must be simple so that the 
designer can reason about it. Section 3 and Section 4 de- 
tail the interface exposed by the data plane mechanisms 
to the control plane policy. 
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To meet these goals and to guide a designer, PADS di- 
vides the control policy into a routing policy and a block- 
ing policy. This division is useful because it introduces a 
separation of concerns for a system designer. 

First, a system’s trade-offs among performance, avail- 
ability, and resource consumption goals largely map to 
routing rules. For example, sending all updates to all 
nodes provides excellent response time and availability, 
whereas caching data on demand requires fewer network 
and storage resources. As described in Section 3, a PADS 
routing policy is an event-driven program that builds on 
the data plane mechanisms exposed by the PADS API to 
set up data flows among nodes in order to transmit and 
store the desired data at the desired nodes. 

Second, a system’s durability and consistency con- 
straints are naturally expressed as conditions that must 
be met when an object is read or updated. For example, 
the enforcement of a specific consistency semantic might 
require a read to block until it can return the value of 
the most recently completed write. As described in Sec- 
tion 4, a PADS blocking policy specifies these require- 
ments as a Set of predicates that block access to an object 
until the predicates are satisfied. 

Blocking policy works together with routing policy to 
enforce the safety constraints and the liveness goals of 
a system. Blocking policy enforce safety conditions by 
ensuring that an operation blocks until system invariants 
are met, whereas routing policy guarantee liveness by en- 
suring that an operation will eventually unblock—by set- 
ting up data flows to ensure the conditions are eventually 
satisfied. 


2.1 Using PADS 


As Figure 2 illustrates, in order to build a distributed stor- 
age system on PADS, a system designer writes a routing 
policy and a blocking policy. She writes the routing pol- 
icy as an event-driven program comprising a set of rules 
that send or fetch updates among nodes when particular 
events exposed by the underlying data plane occur. She 
writes her blocking policy as a list of predicates. She 
then uses a PADS compiler to translate her routing rules 


into Java and places the blocking predicates in a config- 
uration file. Finally, she distributes a Java jar file con- 
taining PADS’s standard data plane mechanisms and her 
system’s control policy to the system’s nodes. Once the 
system is running at each node, users can access locally 
stored data, and the system synchronizes data among 
nodes according to the policy. 


2.2 Policies vs. goals 


A PADS policy is a specific set of directives rather than 
a statement of a system’s high-level goals. Distributed 
storage design is a creative process and PADS does not 
attempt to automate it: a designer must still devise a 
strategy to resolve trade-offs among factors like perfor- 
mance, availability, resource consumption, consistency, 
and durability. For example, a policy designer might de- 
cide on a client-server architecture and specify “When 
an update occurs at a client, the client should send the 
update to the server within 30 seconds” rather than stat- 
ing “Machine X has highly durable storage” and “Data 
should be durable within 30 seconds of its creation” and 
then relying on the system to derive a client-server archi- 
tecture with a 30 second write buffer. 


2.3. Scope and limitations 


PADS targets distributed storage environments with mo- 
bile devices, nodes connected by WAN networks, or 
nodes in developing regions with limited or intermittent 
connectivity. In these environments, factors like limited 
bandwidth, heterogeneous device capabilities, network 
partitions, or workload properties force interesting trade- 
offs among data placement, update propagation, and con- 
sistency. Conversely, we do not target environments like 
well-connected clusters. 

Within this scope, there are three design issues for 
which the current PADS prototype significantly restricts 
a designer’s choices 

First, the prototype does not support security specifi- 
cation. Ultimately, our policy architecture should also 
define flexible security primitives, and providing such 
primitives is important future work [18]. 

Second, the prototype exposes an object-store inter- 
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face for local reads and writes. It does not expose other 
interfaces such as a file system or a tuple store. We be- 
lieve that these interfaces are not difficult to incorporate. 
Indeed, we have implemented an NFS interface over our 
prototype. 

Third, the prototype provides a single mechanism for 
conflict resolution. Write-write conflicts are detected and 
logged in a way that is data-preserving and consistent 
across nodes to support a broad range of application- 
level resolvers. We implement a simple last writer wins 
resolution scheme and believe that it is straightforward to 
extend PADS to support other schemes [14, 31, 13, 28, 6]. 


3 Routing policy 


In PADS, the basic abstraction provided by the data plane 
is a subscription—a unidirectional stream of updates to 
a specific subset of objects between a pair of nodes. A 
policy designer controls the data plane’s subscriptions to 
implement the system’s routing policy. For example, if 
a designer wants to implement hierarchical caching, the 
routing policy would set up subscriptions among nodes 
to send updates up and to fetch data down the hierarchy. 
If a designer wants nodes to randomly gossip updates, 
the routing policy would set up subscriptions between 
random nodes. If a designer wants mobile nodes to ex- 
change updates when they are in communication range, 
the routing policy would probe for available neighbors 
and set up subscriptions at opportune times. 

Given this basic approach, the challenge is to define 
an API that is sufficiently expressive to construct a wide 
range of systems and yet sufficiently simple to be com- 
prehensible to a designer. As the rest of this section de- 
tails, PADS provides three sets of primitives for specify- 
ing routing policies: (1) a set of 7 actions that establish 
or remove subscriptions to direct communication of spe- 
cific subsets of data among nodes, (2) a set of 9 triggers 
that expose the status of local operations and informa- 
tion flow, and (3) a set of 5 stored events that allow a 
routing policy to persistently store and access configura- 
tion options and information affecting routing decisions 
in data objects. Consequently, a system’s routing policy 
is specified as an event-driven program that invokes the 
appropriate actions or accesses stored events based on 
the triggers received. 

In the rest of this section, we discuss details of these 
PADS primitives and try to provide an intuition for why 
these few primitives can cover a large part of the design 
space. We do not claim that these primitives are minimal 
or that they are the only way to realize this approach. 
However, they have worked well for us in practice. 


3.1 Actions 


The basic abstraction provided by a PADS action is sim- 
ple: an action sets up a subscription to route updates 
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Routing Actions 


Add Inval Sub srcId, destId, objS, [startTime], 
ee Fe 


Fig. 3: Routing actions provided by PADS. objld, off, and len 
indicate the object identifier, offset, and length of the update 
to be sent. startTime specifies the logical start time of the sub- 
scription. writerld and time indicate the logical time of a par- 
ticular update. The fields for the B Action are policy defined. 





from one node to another or removes an established sub- 
scription to stop sending updates. As Figure 3 shows, the 
subscription establishment API (Add Inval Sub and Add 
Body Sub) provides five parameters that allow a designer 
to control the scope of subscriptions: 


e Selecting the subscription type. The designer decides 
whether invalidations or bodies of updates should be 
sent. Every update comprises an invalidation and a 
body. An invalidation indicates that an update of a 
particular object occurred at a particular instant in log- 
ical time. Invalidations aid consistency enforcement 
by providing a means to quickly notify nodes of up- 
dates and to order the system’s events. Conversely, a 
body contains the data for a specific update. 


e Selecting the source and destination nodes. Since sub- 
scriptions are unidirectional streams, the designer in- 
dicates the direction of the subscription by specifying 
the source node (srcld) of the updates and the desti- 
nation node (dest/d) to which the updates should be 
transmitted. 


e Selecting what data to send. The designer specifies 
what data to send by specifying the objects of inter- 
est for a subscription so that only updates for those 
objects are sent on the subscription. PADS exports a 
hierarchical namespace in which objects are identified 
with unique strings (e.g., /x/y/z) and a group of related 
objects can be concisely specified. (e.g., /a/b/*). 


e Selecting the logical start time. The designer specifies 
a logical start time so that the subscription can send 
all updates that have occurred to the objects of interest 
from that time. The start time is specified as a partial 
version vector and is set by default to the receiver’s 
current logical time. 


e Selecting the catch-up method. If the start time for 
an invalidation subscription is earlier than the sender’s 
current logical time, the sender has two options: The 
sender can transmit either a Jog of the updates that 
have occurred since the start time or a checkpoint that 
includes just the most recent update to each byterange 
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Local Read/Write Triggers 


Operation block 


obj, off, len, 
blocking_point, failed_predicates 
obj, off, len, writerld, time 
obj, writerld, time 
Message Arrival Triggers 

srcId, obj, off, len, writerId, time 

Send body success | srcld, obj, off, len, writerId, time 

Send body failed srcId, destId, obj, off, len, writerId, time 

Connection Triggers 


srcld, destld, objS, Inval| Body 
Subscription caught-up | srclId, destId, objS, Inval 











Subscription end srcId, destId, objS, Reason, Inval|Body 


Fig. 4: Routing triggers provided by PADS. blocking_point and 
failed_predicates indicate at which point an operation blocked 
and what predicate failed (refer to Section 4). Inval | Body 
indicate the type of subscription. Reason indicates if the sub- 
scription ended due to failure or termination. 


since the start time. These options have different per- 
formance tradeoffs. Sending a log is more efficient 
when the number of recent changes is small compared 
to the number of objects covered by the subscription. 
Conversely, a checkpoint is more efficient if (a) the 
start time is in the distant past (so the log of events is 
long) or (b) the subscription set consists of only a few 
objects (so the size of the checkpoint is small). Note 
that once a subscription catches up with the sender’s 
current logical time, updates are sent as they arrive, 
effectively putting all active subscriptions into a mode 
of continuous, incremental log transfer. For body sub- 
scriptions, if the start time of the subscription is earlier 
than the sender’s current time, the sender transmits a 
checkpoint containing the most recent update to each 
byterange. The log option is not available for send- 
ing bodies. Consequently, the data plane only needs to 
store the most recent version of each byterange. 


In addition to the interface for creating subscriptions 
(Add Inval Sub and Add Body Sub), PADS provides Re- 
move Inval Sub and Remove Body Sub to remove estab- 
lished subscriptions, Send Body to send an individual 
body of an update that occurred at or after the speci- 
fied time, Assign Seg to mark a previous update with a 
commit sequence number to aid enforcement of consis- 
tency [23], and B Action to allow the routing policy to 
send an event to the blocking policy (refer to Section 4). 
Figure 3 details the full routing actions API. 


3.2 Triggers 


PADS triggers expose to the control plane policy events 
that occur in the data plane. As Figure 4 details, these 
events fall into three categories. 


e Local operation triggers inform the routing policy 
when an operation blocks because it needs additional 
information to complete or when a local write or delete 
occurs. 


obj 
obj 


Fig. 5: PADS’s stored events interface. objld specifies the ob- 
ject in which the events should be stored or read from. event- 
Name defines the name of the event to be written and field* 
specify the values of fields associated with it. 





e Message receipt triggers inform the routing policy 
when an invalidation arrives, when a body arrives, or 
whether a send body succeeds or fails. 


e Connection triggers inform the routing policy when 
subscriptions are successfully established, when a sub- 
scription has caused a receiver’s state to be caught up 
with a sender’s state (1.e., the subscription has trans- 
mitted all updates to the subscription set up to the 
sender’s current time), or when a subscription is re- 
moved or fails. 


3.3. Stored events 


Many systems need to maintain persistent state to make 
routing decisions. Supporting this need is challenging 
both because we want an abstraction that meshes well 
with our event-driven programming model and because 
the techniques must handle a wide range of scales. In 
particular, the abstraction must not only handle simple, 
global configuration information (e.g., the server identity 
in a client-server system like Coda [14]), but it must also 
scale up to per-file information (e.g., which nodes store 
the gold copies of each object in Pangaea [26].) 

To provide a uniform abstraction to address this range 
of demands, PADS provides stored events primitives to 
store events into a data object in the underlying persis- 
tent object store. Figure 5 details the full API for stored 
events. A Write Event stores an event into an object and 
a Read Event causes all events stored in an object to be 
fed as input to the routing program. The API also in- 
cludes Read and Watch to produce new events whenever 
they are added to an object, Stop Watch to stop producing 
new events from an object, and Delete Events to delete all 
events in an object. 

For example, in a hierarchical information dissemi- 
nation system, a parent p keeps track of what volumes 
a child subscribes to so that the appropriate subscrip- 
tions can be set up. When a child c subscribes to a new 
volume v, p stores the information in a configuration 
object /subInfo by generating a <write_event, /subInfo, 
child_sub, p, c, v> action. When this information is 
needed, for example on startup or recovery, the parent 
generates a <read_event, /subInfo> action that causes a 
<child_sub, p, c, v> event to be generated for each item 
stored in the object. The child_sub events, in turn, trig- 
ger event handlers in the routing policy that re-establish 
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subscriptions. 

3.4 Specifying routing policy 

A routing policy is specified as an event-driven program 
that invokes actions when local triggers or stored events 
are received. PADS provides R/OverLog, a language 
based on the OverLog routing language [17] and a run- 
time to simplify writing event-driven policies. ! 

As in OverLog, a R/OverLog program defines a set of 
tables and a set of rules. Tables store tuples that represent 
internal state of the routing program. This state does not 
need to be persistently stored, but is required for policy 
execution and can dynamically change. For example, a 
table might store the ids of currently reachable nodes. 
Rules are fired when an event occurs and the constraints 
associated with the rule are met. The input event to a 
rule can be a trigger injected from the local data plane, 
a stored event injected from the data plane’s persistent 
state, or an internal event produced by another rule on a 
local machine or a remote machine. Every rule generates 
a single event that invokes an action in the data plane, 
fires another local or remote rule, or is stored in a table 
as a tuple. For example, the following rule: 

EVT_clientReadMiss(@S, X, Obj, Off, Len):- 
TRIG_operationBlock(@X, Obj, Off, Len, BPoint,_), 
TBL_serverld(@X, S), 

BPoint == “readNowBlock”’. 


specifies that whenever node X receives a operationBlock 
trigger informing it of an operation blocked at the read- 
NowBlock blocking point, it should produce a new event 
clientReadMiss at server S, identified by serverld table. 
This event is populated with the fields from the triggering 
event and the constraints—the client id (X), the data to be 
read (obj, off, len), and the server to contact (S). Note that 
the underscore symbol (_) 1s a wildcard that matches any 
list of predicates and the at symbol (@) specifies the node 
at which the event occurs. A more complete discussion 
of OverLog language and execution model is available 
elsewhere [17]. 


4 Blocking policy 


A system’s durability and consistency constraints can be 
naturally expressed as invariants that must hold when an 
object is accessed. In PADS, the system designer speci- 
fies these invariants as a set of predicates that block ac- 
cess to an object until the conditions are satisfied. To that 
end, PADS (1) defines 5 blocking points for which a sys- 
tem designer specifies predicates, (2) provides 4 built-in 
conditions that a designer can use as predicates, and (3) 
exposes a B_Action interface that allows a designer to 
specify custom conditions based on routing information. 


'Note that if learning a domain specific language is not one’s cup of 
tea, one can define a (less succinct) policy by writing Java handlers for 
PADS triggers and stored events to generate PADS actions and stored 
events. 
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is Valid Block until node has received the body corre- 
sponding to the highest received invalidation 
for the target object 

isComplete Block until object’s consistency state reflects 
all updates before the node’s current logical 
time 


Block until object’s total order is established 


maxStaleness Block until all writes up to 
nodes, count, t (operationStartTime-t) from count nodes in 
nodes have been received. 


User Defined Conditions on Local or Distributed State 
B_Action Block until an event with fields matching 
event-spec event-spec is received from routing policy 


Fig. 6: Conditions available for defining blocking predicates. 





The set of predicates for each blocking point makes up 
the blocking policy of the system. 


4.1 Blocking points 


PADS defines five points for which a policy can supply a 
predicate and a timeout value to block a request until the 
predicate is satisfied or the timeout is reached. The first 
three are the most important: 


e ReadNowBlock blocks a read until it will return data 
from a moment that satisfies the predicate. Blocking 
here is useful for ensuring consistency (e.g., block un- 
til a read is guaranteed to return the latest sequenced 
write. ) 


WriteEndBlock blocks a write request after it has up- 
dated the local object but before it returns. Blocking 
here is useful for ensuring consistency (e.g., block un- 
til all previous versions of this data are invalidated) 
and durability (e.g., block here until the update is 
stored at the server.) 


ApplyUpdateBlock blocks an invalidation received 
from the network before it is applied to the local data 
object. Blocking here is useful to increase data avail- 
ability by allowing a node to continue serving local 
data, which it might not have been able to if the data 
had been invalidated. (e.g., block applying a received 
invalidation until the corresponding body is received.) 


PADS also provides WriteBeforeBlock to block a write 
before it modifies the underlying data object and Read- 
EndBlock to block a read after it has retrieved data from 
the data plane but before it returns. 


4.2 Blocking conditions 


PADS provides a set of predefined conditions, listed in 
Figure 6, to specify predicates at each blocking point. 
A blocking predicate can use any combination of these 
predicates. The first four conditions provide an interface 
to the consistency bookkeeping information maintained 
in the data plane on each node. 


e /sValid requires that the last body received for an ob- 
ject is as new as the last invalidation received for that 
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object. isValid is useful for enforcing monotonic co- 
herence on reads” and for maximizing availability by 
ensuring that invalidations received from other nodes 
are not applied until they can be applied with their cor- 
responding bodies [6, 20]. 

e [sComplete requires that a node receives all invalida- 
tions for the target object up to the node’s current log- 
ical time. [sComplete is needed because liveness poli- 
cies can direct arbitrary subsets of invalidations to a 
node, so a node may have gaps in its consistency state 
for some objects. If the predicate for ReadNowBlock 
is set to isValid and isComplete, reads are guaranteed 
to see causal consistency. 

e [sSequenced requires that the most recent write to the 
target object has been assigned a position in a to- 
tal order. Policies that want to ensure sequential or 
stronger consistency can use the Assign Seg routing 
action (see Figure 3) to allow a node to sequence other 
nodes’ writes and specify the isSequenced condition 
as a ReadNowBlock predicate to block reads of unse- 
quenced data. 

e MaxStaleness is useful for bounding real time stale- 
ness. 


The fifth condition on which a blocking predicate can 
be based on is B_Action. A B_Action condition provides 
an interface with which a routing policy can signal an 
arbitrary condition to a blocking predicate. An operation 
waiting for event-spec unblocks when the routing rules 
produce an event whose fields match the specified spec. 


Rationale. The first four, built-in consistency book- 
keeping primitives exposed by this API were developed 
because they are simple and inexpensive to maintain 
within the data plane [2, 35] but they would be complex 
or expensive to maintain in the control plane. Note that 
they are primitives, not solutions. For example, to en- 
force linearizability, one must not only ensure that one 
reads only sequenced updates (e.g., via blocking at Read- 
NowBlock on isSequenced) but also that a write operation 
blocks until all prior versions of the object have been in- 
validated (e.g., via blocking at WriteEndBlock on, say, 
the B_Action allInvalidated which the routing policy pro- 
duces by tracking data propagation through the system). 

Beyond the four pre-defined conditions, a policy- 
defined B_Action condition is needed for two reasons. 
The most obvious need is to avoid having to predefine 
all possible interesting conditions. The other reason for 
allowing conditions to be met by actions from the event- 
driven routing policy is that when conditions reflect dis- 
tributed state, policy designers can exploit knowledge of 
their system to produce better solutions than a generic 
implementation of the same condition. For example, in 


? Any read on an object will return a version that is equal to or newer 
than the version that was last read. 


the client-server system we describe in Section 6, a client 
blocks a write until it is sure that all other clients caching 
the object have been invalidated. A generic implemen- 
tation of the condition might have required the client 
that issued the write to contact all other clients. How- 
ever, a policy-defined event can take advantage of the 
client-server topology for a more efficient implementa- 
tion. The client sets the writeEndBlock predicate to a 
policy-defined receivedAllAcks event. Then, when an ob- 
ject is written and other clients receive an invalidation, 
they send acknowledgements to the server. When the 
server gathers acknowledgements from all other clients, 
it generates a receivedAllAcks action for the client that 
issued the write. 


5 Constructing P-TierStore 


As an example of how to build a system with PADS, we 
describe our implementation of P-TierStore, a system in- 
spired by TierStore [6]. We choose this example because 
it is simple and yet exercises most aspects of PADS. 


5.1 System goals 


TierStore is a distributed object storage system that tar- 
gets developing regions where networks are bandwidth- 
constrained and unreliable. Each node reads and writes 
specific subsets of the data. Since nodes must often op- 
erate in disconnected mode, the system prioritizes 100% 
availability over strong consistency. 


5.2 System design 


In order to achieve these goals, TierStore employs a hi- 
erarchical publish/subscribe system. All nodes are ar- 
ranged in a tree. To propagate updates up the tree, every 
node sends all of its updates and its children’s updates 
to its parent. To flood data down the tree, data are parti- 
tioned into “publications” and every node subscribes to a 
set of publications from its parent node covering its own 
interests and those of its children. For consistency, Tier- 
Store only supports single-object monotonic reads coher- 
ence. 


5.3. Policy specification 


In order to construct P-TierStore, we decompose the de- 
sign into routing policy and blocking policy. 

A 14-rule routing policy establishes and maintains the 
publication aggregation and multicast trees. A full list- 
ing of these rules is available elsewhere [3]. In terms 
of PADS primitives, each connection in the tree is sim- 
ply an invalidation subscription and a body subscription 
between a pair of nodes. Every PADS node stores in con- 
figuration objects the ID of its parent and the set of pub- 
lications to subscribe to. 

On start up, a node uses stored events to read the con- 
figuration objects and store the configuration information 
in R/OverLog tables (4 rules). When it knows of the ID 
of its parent, it adds subscriptions for every item in the 
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publication set (2 rules). For every child, it adds sub- 
scriptions for “/*” to receive all updates from the child 
(2 rules). If an application decides to subscribe to an- 
other publication, it simply writes to the configuration 
object. When this update occurs, a new stored event is 
generated and the routing rules add subscriptions for the 
new publication. 


Recovery. If an incoming or an outgoing subscription 
fails, the node periodically tries to re-establish the con- 
nection (2 rules). Crash recovery requires no extra pol- 
icy rules. When a node crashes and starts up, it sim- 
ply re-establishes the subscriptions using its local logical 
time as the subscription’s start time. The data plane’s 
subscription mechanisms automatically detect which up- 
dates the receiver is missing and send them. 


Delay tolerant network (DTN) support. P-TierStore 
supports DTN environments by allowing one or more 
mobile PADS nodes to relay information between a par- 
ent and a child in a distribution tree. In this configura- 
tion, whenever a relay node arrives, a node subscribes to 
receive any new updates the relay node brings and pushes 
all new local updates for the parent or child subscription 
to the relay node (4 rules). 


Blocking policy. Blocking policy is simple because 
TierStore has weak consistency requirements. Since 
TierStore prefers stale available data to unavailable data, 
we set the ApplyUpdateBlock to is Valid to avoid applying 
an invalidation until the corresponding body is received. 


TierStore vs. P-TierStore. Publications in TierStore 
are defined by a container name and depth to include all 
objects up to that depth from the root of the publication. 
However, since P-TierStore uses a name hierarchy to de- 
fine publications (e.g., /publication1/*), all objects under 
the directory tree become part of the subscription with no 
limit on depth. 

Also, as noted in Section 2.3, PADS provides a single 
conflict-resolution mechanism, which differs from that 
of TierStore in some details. Similarly, TierStore pro- 
vides native support for directory objects, while PADS 
supports a simple untyped object store interface. 


6 Experience and evaluation 


Our central thesis is that it is useful to design and build 
distributed storage systems by specifying a control plane 
comprising a routing policy and a blocking policy. There 
is no quantitative way to prove that this approach is good, 
so we base our evaluation on our experience using the 
PADS prototype. 

Figure | conveys the main result of this paper: using 
PADS, a small team was able to construct a dozen signif- 
icant systems with a large number of features that cover 
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a large part of the design space. PADS qualitatively re- 
duced the effort to build these systems and increased our 
team’s capabilities: we do not believe a small team such 
as ours could have constructed anything approaching this 
range of systems without PADS. 

In the rest of this section, we elaborate on this ex- 
perience by first discussing the range of systems stud- 
ied, the development effort needed, and our debugging 
experience. We then explore the realism of the sys- 
tems we constructed by examining how PADS handles 
key system-building problems like configuration, consis- 
tency, and crash recovery. Finally, we examine the costs 
of PADS’s generality: what overheads do our PADS im- 
plementations pay compared to ideal or hand-crafted im- 
plementations? 


Approach and environment. The goal of PADS is to 
help people develop new systems. One way to evaluate 
PADS would be to construct a new system for a new de- 
manding environment and report on that experience. We 
choose a different approach—constructing a broad range 
of existing systems—for three reasons. First, a single 
system may not cover all of the design choices or test 
the limits of PADS. Second, it might not be clear how 
to generalize the experience from building one system to 
building others. Third, it might be difficult to disentangle 
the challenges of designing a new system for a new envi- 
ronment from the challenges of realizing a design using 
PADS. 

The PADS prototype uses PRACTI [2, 35] to provide 
the data plane mechanisms. We implement a R/OverLog 
to Java compiler using the XTC toolkit [9]. Except where 
noted, all experiments are carried out on machines with 
3GHz Intel Pentium IV Xeon processors, 1GB of mem- 
ory, and 1Gb/s Ethernet. Machines and network connec- 
tions are controlled via the Emulab software [33]. For 
software, we use Fedora Core 8, BEA JRockit JVM Ver- 
sion 27.4.0, and Berkeley DB Java Edition 3.2.23. 


6.1 System development on PADS 


This section describes the design space we have covered, 
how the agility of the resulting implementations makes 
them easy to adapt, the design effort needed to construct 
a system under PADS, and our experience debugging and 
analyzing our implementations. 


6.1.1 Flexibility 


We constructed systems chosen from the literature to 
cover large part of the design space. We refer to our im- 
plementation of each system as P-system (e.g., P-Coda). 
To provide a sense of the design space covered, we pro- 
vide a short summary of each of the system’s properties 
below and in Figure 1. 


Generic client-server. We construct a simple client- 
server (P-SCS) and a full featured client-server (P-FCS). 
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Objects are stored on the server, and clients cache the 
data from the server on demand. Both systems imple- 
ment callbacks in which the server keeps track of which 
clients are storing a valid version of an object and sends 
invalidations to them whenever the object is updated. 
The difference between P-SCS and P-FCS is that P-SCS 
assumes full object writes while P-FCS supports partial- 
object writes and also implements /eases and coopera- 
tive caching. Leases [8] increase availability by allowing 
a server to break a callback for unreachable clients. Co- 
operative caching [5] allows clients to retrieve data from 
a nearby client rather than from the server. Both P-SCS 
and P-FCS enforce sequential consistency semantics and 
ensure durability by making sure that the server always 
holds the body of the most recently completed write of 
each object. 


Coda [14]. Coda is a client-server system that supports 
mobile clients. P-Coda includes the client-server pro- 
tocol and the features described in Kistler et al.’s pa- 
per [14]. It does not include server replication features 
detailed in [27]. Our discussion focuses on P-Coda. P- 
Coda is similar to P-FCS—it implements callbacks and 
leases but not cooperative caching; also, it guarantees 
open-to-close consistency” instead of sequential consis- 
tency. A key feature of Coda is its support for discon- 
nected operation—clients can access locally cached data 
when they are offline and propagate offline updates to 
the server on reconnection. Every client has a hoard list 
that specifies objects to be periodically fetched from the 
server 


TRIP [20]. TRIP is a distributed storage system for 
large-scale information dissemination: all updates occur 
at a server and all reads occur at clients. TRIP uses a 
self-tuning prefetch algorithm and delays applying inval- 
idations to a client’s locally cached data to maximize the 
amount of data that a client can serve from its local state. 
TRIP guarantees sequential consistency via a simple al- 
gorithm that exploits the constraint that all writes are car- 
ried out by a single server. 


TierStore [6]. TierStore is described in Section 5. 


Chain replication [32]. Chain replication is a server 
replication protocol that guarantees linearizability and 
high availability. All the nodes in the system are arranged 
in a chain. Updates occur at the head and are only con- 
sidered complete when they have reached the tail. 


Bayou [23]. Bayou is a server-replication protocol that 
focuses on peer-to-peer data sharing. Every node has a 
local copy of all of the system’s data. From time to time, 


>Whenever a client opens a file, it always gets the latest version of 
the file known to the server, and the server is not updated until the file 
is closed. 


a node picks a peer to exchange updates with via anti- 
entropy sessions. 


Pangaea [26] Pangaea is a peer-to-peer distributed 
storage system for wide area networks. Pangaea main- 
tains a connected graph across replicas for each object, 
and it pushes updates along the graph edges. Pangaea 
maintains three gold replicas for every object to ensure 
data durability. 


Summary of design features. As Figure | further de- 
tails, these systems cover a wide range of design features 
in anumber of key dimensions. For example, 

e Replication: full replication (Bayou, Chain Replica- 
tion, and TRIP), partial replication (Coda, Pangaea, P- 
FCS, and TierStore), demand caching (Coda, Pangaea, 
and P-FCS), 


e Topology: structured topologies such as client-server 
(Coda, P-FCS, and TRIP), hierarchical (TierStore), 
and chain (Chain Replication); unstructured topolo- 
gies (Bayou and Pangaea). Invalidation-based (Coda 
and P-FCS) and update-based (Bayou, TierStore, and 
TRIP) propagation. 


e Consistency: monotonic-reads coherence (Pangaea 
and TierStore), casual (Bayou), sequential (P-FCS and 
TRIP), and linearizability (Chain Replication); tech- 
niques such as callbacks (Coda, P-FCS, and TRIP) and 
leases (Coda and P-FCS). 


e Availability: Disconnected operation (Bayou, Coda, 
TierStore, and TRIP), crash recovery (all), and net- 
work reconnection (all). 


Goal: Architectural equivalence. We build systems 
based on the above designs from the literature, but con- 
structing perfect, “bug-compatible” duplicates of the 
original systems using PADS is not a realistic (or use- 
ful) goal. On the other hand, if we were free to pick and 
choose arbitrary subsets of features to exclude, then the 
bar for evaluating PADS is too low: we can claim to have 
built any system by simply excluding any features PADS 
has difficulty supporting. 

Section 2.3 identifies three aspects of system design— 
security, interface, and conflict resolution—for which 
PADS provides limited support, and our implementations 
of the above systems do not attempt to mimic the original 
designs in these dimensions. 

Beyond that, we have attempted to faithfully imple- 
ment the designs in the papers cited. More precisely, al- 
though our implementations certainly differ in some de- 
tails, we believe we have built systems that are archi- 
tecturally equivalent to the original designs. We define 
architectural equivalence in terms of three properties: 


El. Equivalent overhead. A system’s network bandwidth 


between any pair of nodes and its local storage at any 
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node are within a small constant factor of the target 
system. 


E2. Equivalent consistency. The system provides consis- 


tency and staleness properties that are at least as strong 
as the target system’s. 


E3. Equivalent local data. The set of data that may be ac- 


cessed from the system’s local state without network 
communication is a superset of the set of data that may 
be accessed from the target system’s local state. No- 
tice that this property addresses several factors includ- 
ing latency, availability, and durability. 


There is a principled reason for believing that these prop- 
erties capture something about the essence of a repli- 
cation system: they highlight how a system resolves 
the fundamental CAP (Consistency vs. Availability vs. 
Partition-resilience) [7] and PC (Performance vs. Con- 
sistency) [16] trade-offs that any distributed storage sys- 
tem must make. 


6.1.2 Agility 


As workloads and goals change, a system’s requirements 
also change. We explore how systems build with PADS 
can be adapted by adding new features. We highlight 
two cases in particular: our implementation of Bayou 
and Coda. Even though they are simple examples, they 
demonstrate that being able to easily adapt a system to 
send the right data along the right paths can pay big div- 
idends. 


P-Bayou small device enhancement. P-Bayou is a 
server-replication protocol that exchanges updates be- 
tween pairs of servers via an anti-entropy protocol. Since 
the protocol propagates updates for the whole data set to 
every node, P-Bayou cannot efficiently support smaller 
devices that have limited storage or bandwidth. 

It is easy to change P-Bayou to support small devices. 
In the original P-Bayou design, when anti-entropy is trig- 
gered, a node connects to a reachable peer and subscribes 
to receive invalidations and bodies for all objects using a 
subscription set “/*’. In our small device variation, a 
node uses stored events to read a list of directories from 
a per-node configuration file and subscribes only for the 
listed subdirectories. This change required us to modify 
two routing rules. 

This change raises an issue for the designer. If a small 
device C synchronizes with a first complete server $1, it 
will not receive updates to objects outside of its subscrip- 
tion sets. These omissions will not affect C since C will 
not access those objects. However, if C later synchro- 
nizes with a second complete server $2, S2 may end up 
with causal gaps in its update logs due to the missing up- 
dates that C doesn’t subscribe to. The designer has three 
choices: weaken consistency from causal to per-object 
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Fig. 8: Average read latency of P-Coda and P-Coda with coop- 
erative caching. 


coherence; restrict communication to avoid such situa- 
tions (e.g., prevent C from synchronizing with $2); or 
weaken availability by forcing $2 to fill its gaps by talk- 
ing to another server before allowing local reads of po- 
tentially stale objects. We choose the first, so we change 
the blocking predicate for reads to no longer require the 
isComplete condition. Other designers may make differ- 
ent choices depending on their environment and goals. 

Figure 7 examines the bandwidth consumed to syn- 
chronize 3KB files in P-Bayou and serves two purposes. 
First, 1t demonstrates that the overhead for anti-entropy 
in P-Bayou is relatively small even for small files com- 
pared to an ideal Bayou implementation (plotted by 
counting the bytes of data that must be sent ignoring all 
metadata overheads.) More importantly, it demonstrates 
that if a node requires only a fraction (e.g., 10%) of the 
data, the small device enhancement, which allows a node 
to synchronize a subset of data, greatly reduces the band- 
width required for anti-entropy. 


P-Coda and cooperative caching. In P-Coda, on a 
read miss, a client is restricted to retrieving data from the 
server. We add cooperative caching to P-Coda by adding 
13-rules: 9 to monitor the reachability of nearby nodes, 
2 to retrieve data from a nearby client on a read miss, and 
2 to fall back to the server if the client cannot satisfy the 
data request. 

Figure 8 shows the difference in read latency for 
misses on a I1KB file with and without support for co- 
operative caching. For the experiment, the rount-trip 
latency between the two clients is 10ms, whereas the 
round-trip latency between a client and server is almost 
500ms. When data can be retrieved from a nearby client, 
read performance is greatly improved. More importantly, 
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with this new capability, clients can share data even when 
disconnected from the server. 


6.1.3. Ease of development 


Each of these systems took a few days to three weeks to 
construct by one or two graduate students with part time 
effort. The time includes mapping the original system 
design to PADS policy primitives, implementation, test- 
ing, and debugging. Mapping the design of the original 
implementation to routing and blocking policy was chal- 
lenging at first but became progressively easier. Once the 
design work was done, the implementation did not take 
long. 

Note that routing rules and blocking conditions are 
extremely simple, low-level building bocks. Each rout- 
ing rule specifies the conditions under which a single 
tuple should be produced. R/Overlog lets us specify 
routing rules succinctly—across all of our systems, each 
routing rule is from | to 3 lines of text. The count of 
blocking conditions exposes the complexity of the block- 
ing predicates: each blocking predicate is an equation 
across zero or more blocking condition elements from 
Figure 6, so the count of at most 10 blocking condi- 
tions for a policy indicates that across all of that policy’s 
blocking predicates, a total of 10 conditions were used. 
As Figure | indicates, each system was implemented in 
fewer than 100 routing rules and fewer than 10 blocking 
conditions. 


6.1.4 Debugging and correctness 


Three aspects of PADS help simplify debugging and rea- 
soning about the correctness of PADS systems. 

First, the conciseness of PADS policy greatly facili- 
tates analysis, peer review, and refinement of design. It 
was extremely useful to be able to sit down and walk 
through an entire design in a one or two hour meeting. 

Second, the abstractions themselves divide work in a 
way that simplifies reasoning about correctness. For ex- 
ample, we find that the separation of policy into routing 
and blocking helps reduce the risk of consistency bugs. 
A system’s consistency and durability requirements are 
specified and enforced by simple blocking predicates, so 
it is not difficult to get them right. We must then design 
our routing policy to deliver sufficient data to a node to 
eventually satisfy the predicates and ensure liveness. 

Third, domain-specific languages can facilitate the 
use of model checking [4]. As future work, we intend 
to implement a translator from R/Overlog to Promela [1] 
so that policies can be model checked to test the correct- 
ness of a system’s implementation. 


6.2 Realism 


When building a distributed storage system, a system de- 
signer needs to address issues that arise in practical de- 
ployments such as configuration options, local crash re- 


covery, distributed crash recovery, and maintaining con- 
sistency and durability despite crashes and network fail- 
ures. PADS makes it easy to tackle these issues for three 
reasons. 

First, since the stored events primitive allows routing 
policies to access local objects, policies can store and 
retrieve configuration and routing options on-the-fly. For 
example, in P-TierStore, a nodes stores in a configuration 
object the publications it wishes to access. In P-Pangaea, 
the parent directory object of each object stores the list 
of nodes from which to fetch the object on a read miss. 

Second, for consistency and crash recovery, the un- 
derlying subscription mechanisms insulate the designer 
from low-level details. Upon recovery, local mechanisms 
first reconstruct local state from persistent logs. Also, 
PADS’s subscription primitives abstract away many chal- 
lenging details of resynchronizing node state. Notably, 
these mechanisms track consistency state even across 
crashes that could introduce gaps in the sequences of 1n- 
validations sent between nodes. As a result, crash re- 
covery in most systems simply entails restoring lost sub- 
scriptions and letting the underlying mechanisms ensure 
that the local state reflects any updates that were missed. 

Third, blocking predicates greatly simplify maintain- 
ing consistency during crashes. If there is a crash and 
the required consistency semantics cannot be guaranteed, 
the system will simply block access to “unsafe” data. On 
recovery, once the subscriptions have been restored and 
the predicates are satisfied, the data become accessible 
again. 

In each of the PADS systems we constructed, we im- 
plemented support for these practical concerns. Due 
to space limitations we focus this discussion on the 
behaviour of two systems under failure: the full fea- 
tured client server system (P-FCS) and TierStore (P- 
TierStore). Both are client-server based systems, but they 
have very different consistency guarantees. We demon- 
strate the systems are able to provide their corresponding 
consistency guarantees despite failures. 


Consistency, durability, and crash recovery in P-FCS 
and P-TierStore Our experiment uses one server and 
two clients. To highlight the interactions, we add a 50ms 
delay on the network links between the clients and the 
server. Client Cl repeatedly reads an object and then 
sleeps for 500ms, and Client C2 repeatedly writes in- 
creasing values to the object and sleeps for 2000ms. We 
plot the start time, finish time, and value of each opera- 
tion. 

Figure 9 illustrates behavior of P-FCS under failures. 
P-FCS guarantees sequential consistency by maintaining 
per-object callbacks [11], maintaining object leases [8], 
and blocking the completion of a write until the server 
has stored the write and invalidated all other client 
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Fig. 9: Demonstration of full client-server system, P-FCS, un- 
der failures. The x axis shows time and the y axis shows the 
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Fig. 10: Demonstration of TierStore under a workload similar 
to that in Figure 9. 


caches. We configure the system with a 10 second lease 
timeout. During the first 20 seconds of the experiment, as 
the figure indicates, sequential consistency is enforced. 
We kill (kill -9) the server process 20 seconds into the 
experiment and restart it 10 seconds later. While the 
server is down, writes block immediately but reads con- 
tinue until the lease expires after which reads block as 
well. When we restart the server, it recovers its local 
state and then resumes processing requests. Both reads 
and writes resume shortly after the server restarts, and the 
subscription reestablishment and blocking policy ensure 
that consistency is maintained. 

We kill the reader, C1, at 5O seconds and restart it 15 
seconds later. Initially, writes block, but as soon as the 
lease expires, writes proceed. When the reader restarts, 
reads resume as well. 

Figure 10 illustrates a similar scenario using P- 
TierStore. P-TierStore enforces monotonic reads coher- 
ence rather than sequential consistency, and it propagates 
updates via subscriptions when the network is available. 
As a result, all reads and writes complete locally and 
without blocking despite failures. During periods of no 
failures, the reader receives updates quickly and reads re- 
turn recent values. However, if the server is unavailable, 
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Fig. 11: Network overheads of primitives. Here, Nodes 18 the 
number of nodes. Nssgp; is the number of objects in the sub- 
scription set. Nssprevtpdates ANd Nssopjupa are the number of 
updates that occurred and the number objects in the subscrip- 
tion set that were modified from a subscription start time to the 
current logical time. Nssnewt pdates 18 the number of updates to 
the subscription set that occur after the subscription has caught 
up to the sender’s logical time. 


writes still progress, and the reads return values that are 
locally stored even if they are stale. 


6.3 Performance 


The programming model exposed to designers must have 
predictable costs. In particular, the volume of data stored 
and sent over the network should be proportional to the 
amount of information a node is interested in. 

We carry out performance evaluation of PADS 1n two 
steps. First, we evaluate the fundamental costs associ- 
ated with the PADS architecture. In particular, we ar- 
gue that network overheads of PADS are within reason- 
able bounds of ideal implementations and highlight when 
they depart from ideal. 

Second, we evaluate the absolute performance of the 
PADS prototype. We quantify overheads associated with 
the primitives via micro-benchmarks and compare the 
performance of two implementations of the same sys- 
tem: the original implementation with the one built over 
PADS. We find that P-Coda is as much as 3.3 times worse 
than Coda. 


6.3.1 Fundamental overheads and scalability 


Figure 11 shows the network cost associated with our 
prototype’s implementation of PADS’s primitives and in- 
dicates that our costs are close to the ideal of having ac- 
tual costs be proportional to the amount of new infor- 
mation transferred between nodes. Note that these ideal 
costs may not be able always be achievable. 

There are two ways that PADS sends extra informa- 
tion. 

First, during invalidation subscription setup in PADS 
the sender transmits a version vector indicating the start 
time of the subscription and catch-up information so that 
the receiver can determine if the catch-up information 
introduces gaps in the receiver’s consistency state. That 
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Fig. 12: Network bandwidth cost to synchronize 1000 1OKB 
files, 100 of which are modified. 


cost is then amortized over all the updates sent on the 
connection. Also, this cost can be avoided by starting a 
subscription at logical time 0 with a checkpoint rather 
than a log for catching up to the current time. Note, 
checkpoint catch-up is particularly cheap when interest 
sets are small. 

Second, in order to support flexible consistency, inval- 
idation subscriptions also carry extra information such as 
imprecise invalidations [2]. Imprecise invalidations sum- 
marize updates to objects out of the subscription set and 
are sent to mark logical gaps in the casual stream of 1n- 
validations. The number of imprecise invalidations sent 
depends on the workload and is never more than the num- 
ber of invalidations of updates to objects in the subscrip- 
tion set sent. The size of imprecise invalidations depends 
on the locality of the workload and how compactly the 
invalidations compress into imprecise invalidations. 

Overall, we expect PADS to scale well to systems with 
large numbers of objects or nodes—subscription sets and 
imprecise invalidations ensure that the number of records 
transferred is proportional to amount of data of interest 
(and not to the overall size of the database), and the per- 
node overheads associated with the version vectors used 
to set up some subscriptions can be amortized over all of 
the updates sent. 


6.3.2 Quantifying the constants 


We run experiments to investigate the constant factors 
in the cost model and quantify the overheads associated 
with subscription setup and flexible consistency. Fig- 
ure 12 illustrates the synchronization cost for a simple 
scenario. In this experiment, there are 10,000 objects 
in the system organized into 10 groups of 1,000 objects 
each, and each object’s size is 1OKB. The reader registers 
to receive invalidations for one of these groups. Then, the 
writer updates 100 of the objects in each group. Finally, 
the reader reads all the objects. 

We look at four scenarios representing combinations 
of coarse-grained vs. fine-grained synchronization and 
of writes with locality vs. random writes. For coarse- 
grained synchronization, the reader creates a single inval- 


1KB objects 100KB objects 
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Fig. 13: Read and write latencies in milliseconds for Coda and 
P-Coda. The numbers in parantheses indicate factors of over- 
head. The values are averages of 5 runs. 


idation subscription and a single body subscription span- 
ning all 1000 objects in the group of interest and receives 
100 updated objects. For fine-grained synchronization, 
the reader creates 1000 invalidation subscriptions, each 
for one object, and fetches each of the 100 updated bod- 
ies. For writes with locality, the writer updates 100 ob- 
jects in the ith group before updating any in the i+ Ist 
group. For random writes, the writer intermixes writes 
to different groups in a random order. 

Four things should be noted. First, the synchroniza- 
tion overheads are small compared to the body data trans- 
ferred. Second, the “extra” overheads associated with 
PADS subscription setup and flexible consistency over 
the best case is a small fraction of the total overhead 
in all cases. Third, when writes have locality, the over- 
head of flexible consistency drops further because larger 
numbers of invalidations are combined into an impre- 
cise invalidation. Fourth, coarse-grained synchronization 
has lower overhead than fine-grained synchronization be- 
cause it avoids per-object subscription setup costs. 

Similarly, Figure 7 compares the bandwidth overhead 
associated with using a PADS system implementation 
with an ideal implementation. As the figure indicates, the 
bandwidth to propagate updates is close to ideal imple- 
mentations. The extra overhead is due to the meta-data 
sent with each update. 


6.3.3 Absolute Performance 


Our goal is to provide sufficient performance to be use- 
ful. We compare the performance of a hand-crafted im- 
plementation of a system (Coda) that has been in produc- 
tion use for over a decade and a PADS implementation of 
the same system (P-Coda). We expect to pay some over- 
heads for three reasons. First, PADS 1s a relatively un- 
tuned prototype rather than well-tuned production code. 
Second, our implementation emphasizes portability and 
simplicity, so PADS is written in Java and stores data 
using BerkeleyDB rather than running on bare metal. 
Third, PADS provides additional functionality such as 
tracking consistency metadata, some of which may not 
be required by a particular hand-crafted system. 

Figure 13 compares the client-side read and write la- 
tencies under Coda and P-Coda. The systems are set up 
in a two client configuration. To measure the read la- 
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tencies, client Cl has a collection of 1,000 objects and 
Client C2 has none. For cold reads, Client C2 randomly 
selects 100 objects to read. Each read fetches the object 
from the server and establishes a callback for the object. 
C2 re-reads those objects to measure the hot-read latency. 
To measure the connected write latency, both Cl and C2 
initially store the same collection of 1,000 objects. C2 
selects 100 objects to write. The write will cause the 
server to store the update and break a callback with Cl 
before the write completes at C2. Disconnected writes 
are measured by disconnecting C2 from the server and 
writing to 100 randomly selected objects. 

The performance of PADS’s implementation is com- 
parable to hand-crafted C implementation in most cases 
and is at most 3 times worse in the worst case we mea- 
sured. 


7 Related work 


PADS and PRACTI. We use a modified version of 

PRACTI [2, 35] as the data plane for PADS. Writing a 

new policy in PADS differs from constructing a system 

using PRACTI alone for three reasons. 

1. PADS adds key abstractions not present in PRACTI 
such as the separation of routing policy from blocking 
policy, stored events, and commit actions. 


2. PADS significantly changes abstractions from those 


provided in PRACTI. We distilled the interface be- 
tween mechanism and policy to the handful of calls 
in Figures 3, 4, and 5, and we changed the underly- 
ing protocols and mechanisms to meet the needs of 
the data plane required by PADS. For example, where 
the original PRACTI protocol provides the abstraction 
of connections between nodes, each of which carries 
one subscription, PADS provides the more lightweight 
abstraction of subscriptions which forced us to re- 
design the protocol to multiplex subscriptions onto 
a single connection between a pair of nodes in or- 
der to efficiently support fine-grained subscriptions 
and dynamic addition of new items to a subscrip- 
tion. Similarly, where PRACTI provides the abstrac- 
tion of bound invalidations to make sure that bodies 
and updates propagate together, PADS provides the 
more flexible blocking predicates, and where PRACTI 
hard-coded several mechanisms to track the progress 
of updates through the system, PADS simply triggers 
the routing policy and lets the routing policy handle 
whatever notifications are needed. 


3. PADS provides R/OverLog which has proven to be a 


convenient way to design about, write, and debug rout- 
ing policies. 
The whole is more important than the parts. Building 
systems with PADS is much simpler than without. In 
some cases this is because PADS provides abstractions 
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not present in PRACTI. In others, it is “merely” because 
PADS provides a better way of thinking about the prob- 
lem. 


R/OverLog and OverLog R/OverLog extends Over- 
Log [17] by (1) adding type information to events, (2) 
providing an interface to pass triggers, actions, and 
stored events as tuples between PADS and the R/OverLog 
program, and (3) restricting the syntax slightly to allow 
us to implement a R/OverLog-to-Java compiler that pro- 
duces executables that are more stable and faster than 
programs under the more general P2 [17] runtime sys- 
tem. 


Other frameworks. A number of other efforts have 
defined frameworks for constructing distributed storage 
systems for different environments. Deceit [29] focuses 
on distributed storage across a well-connected cluster of 
servers. Stackable file systems [10] seek to provide a 
way to add features and compose file systems, but it fo- 
cuses on adding features to local file systems. 

Some systems, such as Cimbiosys [24], distribute 
data among nodes not based on object identifiers or file 
names, but rather on content-based filters. We see no 
fundamental barriers to incorporating filters in PADS to 
identify sets of related objects. This would allow sys- 
tem designers to set up subscriptions and maintain con- 
sistency state in terms of filters rather than object-name 
prefixes. 

PADS follows in the footsteps of efforts to define run- 
time systems or domain-specific languages to ease the 
construction of routing [17], overlay [25], cache consis- 
tency protocols [4], and routers [15]. 


$ Conclusion 


Our goal is to allow developers to quickly build new dis- 
tributed storage systems. This paper presents PADS, a 
policy architecture that allows developers to construct 
systems by specifying policy without worrying about 
complex low-level implementation details. Our experi- 
ence has led us to make two conclusions: First, the ap- 
proach of constructing a system in terms of a routing pol- 
icy and a blocking policy over a data plane greatly re- 
duces development time. Second, the range of systems 
implemented with the small number of primitives ex- 
posed by the API suggest that the primitives adequately 
capture the key abstractions for building distributed stor- 
age systems. 
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Abstract 


This paper presents Sora, a fully programmable soft- 
ware radio platform on commodity PC architectures. 
Sora combines the performance and fidelity of hardware 
SDR platforms with the programmability and flexibil- 
ity of general-purpose processor (GPP) SDR platforms. 
Sora uses both hardware and software techniques to ad- 
dress the challenges of using PC architectures for high- 
speed SDR. The Sora hardware components consist of 
a radio front-end for reception and transmission, and 
a radio control board for high-throughput, low-latency 
data transfer between radio and host memories. Sora 
makes extensive use of features of contemporary proces- 
sor architectures to accelerate wireless protocol process- 
ing and satisfy protocol timing requirements, including 
using dedicated CPU cores, large low-latency caches to 
store lookup tables, and SIMD processor extensions for 
highly efficient physical layer processing on GPPs. Us- 
ing the Sora platform, we have developed a demonstra- 
tion radio system called SoftWiFi. SoftWiFi seamlessly 
interoperates with commercial 802.1 1la/b/g NICs, and 
achieves equivalent performance as commercial NICs at 
each modulation. 


1 Introduction 


Software defined radio (SDR) holds the promise of fully 
programmable wireless communication systems, effec- 
tively supplanting current technologies which have the 
lowest communication layers implemented primarily in 
fixed, custom hardware circuits. Realizing the promise 
of SDR in practice, however, has presented developers 
with a dilemma. 

Many current SDR platforms are based on either pro- 
grammable hardware such as field programmable gate 
arrays (FPGAs) [6, 11] or embedded digital signal pro- 
cessors (DSPs) [5, 13]. Such hardware platforms can 
meet the processing and timing requirements of mod- 
ern high-speed wireless protocols, but programming FP- 
GAs and specialized DSPs are difficult tasks. Develop- 
ers have to learn how to program to each particular em- 

This work was performed when Ji Fang, He Liu, Yusheng Ye, 


and Shen Wang were visiting students and Geoffrey M. Voelker was a 
visiting researcher at Microsoft Research Asia. 


bedded architecture, often without the support of a rich 
development environment of programming and debug- 
ging tools. Hardware platforms can also be expensive; 
the WARP [6] educational price, for example, is over 
US$9,750. 


In contrast, SDR platforms based on general-purpose 
processor (GPP) architectures, such as commodity PCs, 
have the opposite set of tradeoffs. Developers pro- 
gram to a familiar architecture and environment using 
sophisticated tools, and radio front-end boards for in- 
terfacing with a PC are relatively inexpensive. How- 
ever, since PC hardware and software have not been 
designed for wireless signal processing, existing GPP- 
based SDR platforms can achieve only limited perfor- 
mance [1,22]. For example, the popular GNU Radio 
platform [1] achieves only a few Kbps throughput on an 
8MHz channel [21], whereas modern high-speed wire- 
less protocols like 802.11 support multiple Mbps data 
rates on a much wider 20MHz channel [7]. These con- 
straints prevent developers from using such platforms to 
achieve the full fidelity of state-of-the-art wireless pro- 
tocols while using standard operating systems and appli- 
cations in a real environment. 


In this paper we present Sora, a fully programmable 
software radio platform that provides the benefits of both 
SDR approaches, thereby resolving the SDR platform 
dilemma for developers. With Sora, developers can im- 
plement and experiment with high-speed wireless pro- 
tocol stacks, e.g., IEEE 802.11la/b/g, using commodity 
general-purpose PCs. Developers program in familiar 
programming environments with powerful tools on stan- 
dard operating systems. Software radios implemented 
on Sora appear like any other network device, and users 
can run unmodified applications on their software ra- 
dios with the same performance as commodity hardware 
wireless devices. 


An implementation of high-speed wireless protocols 
on general-purpose PC architectures must overcome a 
number of challenges that stem from existing hardware 
interfaces and software architectures. First, transferring 
high-fidelity digital waveform samples into PC memory 
for processing requires very high bus throughput. Ex- 
isting GPP platforms like GNU Radio use USB 2.0 or 
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Gigabit Ethernet [1], which cannot satisfy this require- 
ment for high-speed wireless protocols. Second, phys- 
ical layer (PHY) signal processing has very high com- 
putational requirements for generating information bits 
from waveforms, and vice versa, particularly at high 
modulation rates; indeed, back-of-the-envelope calcu- 
lations for processing requirements on GPPs have in- 
stead motivated specialized hardware approaches in the 
past [17, 19]. Lastly, wireless PHY and media ac- 
cess control (MAC) protocols have low-latency real- 
time deadlines that must be met for correct operation. 
For example, the 802.11 MAC protocol requires precise 
timing control and ACK response latency on the order of 
tens of microseconds. Existing software architectures on 
the PC cannot consistently meet this timing requirement. 

Sora uses both hardware and software techniques to 
address the challenges of using PC architectures for 
high-speed SDR. First, we have developed a new, in- 
expensive radio control board (RCB) with a radio front- 
end for transmission and reception. The RCB bridges 
an RF front-end with PC memory over the high-speed 
and low-latency PCIe bus [8]. With this bus standard, 
the RCB can support 16.7Gbps (x8 mode) throughput 
with sub-microsecond latency, which together satisfies 
the throughput and timing requirements of modern wire- 
less protocols while performing all digital signal pro- 
cessing on host CPU and memory. 

Second, to meet PHY processing requirements, Sora 
makes full use of various features of widely adopted 
multi-core architectures in existing GPPs. The Sora 
software architecture also explicitly supports stream- 
lined processing that enables components of the signal 
processing pipeline to efficiently span multiple cores. 
Further, we change the conventional implementation 
of PHY components to extensively take advantage of 
lookup tables (LUTs), trading off computation for mem- 
ory. These LUTs substantially reduce the computational 
requirements of PHY processing, while at the same time 
taking advantage of the large, low-latency caches on 
modern GPPs. Finally, Sora uses the SIMD (Single In- 
struction Multiple Data) extensions in existing proces- 
sors to further accelerate PHY processing. With these 
optimizations, Sora can fully support the complete dig- 
ital processing of 802.11b modulation rates on just one 
core, and 802.1 1a/g on two cores. 

Lastly, to meet the real-time requirements of high- 
speed wireless protocols, Sora provides a new kernel ser- 
vice, core dedication, which allocates processor cores 
exclusively for real-time SDR tasks. We demonstrate 
that it is a simple yet crucial abstraction that guarantees 
the computational resources and precise timing control 
necessary for SDR on a GPP. 

We have developed a demonstration radio system, 
SoftWiFi, based on the Sora platform. SoftWiFi cur- 
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rently supports the full suite of 802.1 1a/b/g modulation 
rates, seamlessly interoperates with commercial 802.11 
NICs, and achieves equivalent performance as commer- 
cial NICs at each modulation. 

In summary, the contributions of this paper are: (1) 
the design and implementation of the Sora platform and 
its high-performance PHY processing library; (2) the de- 
sign and implementation of the SoftWiFi radio system 
that can interoperate with commercial wireless NICs us- 
ing 802.1 1a/b/g standards; and (3) the evaluation of Sora 
and SoftWiFi on a commodity multi-core PC. To the best 
of our knowledge, Sora is the first SDR platform that 
enables users to develop high-speed wireless implemen- 
tations, such as the IEEE 802.1 1a/b/g PHY and MAC, 
entirely in software on a standard PC architecture. 

The rest of the paper is organized as follows. Sec- 
tion 2 provides background on wireless communication 
systems. We then present the Sora architecture in Sec- 
tion 3, and we discuss our approach for addressing the 
challenges of building an SDR platform on a GPP sys- 
tem in Section 4. We then describe the implementation 
of the Sora platform in Section 5. Section 6 presents 
the design and implementation of SoftWiFi, a fully func- 
tional software WiFi radio based on Sora, and we eval- 
uate its performance in Section 7. Finally, Section 9 de- 
scribes related work and Section 10 concludes. 


2 Background and Requirements 


In this section, we briefly review the physical layer 
(PHY) and media access (MAC) components of typi- 
cal wireless communication systems. Although differ- 
ent wireless technologies may have subtle differences 
among one another, they generally follow similar de- 
signs and share many common algorithms. In this sec- 
tion, we use the IEEE 802.1 1la/b/g standards to exem- 
plify characteristics of wireless PHY and MAC compo- 
nents as well as the challenges of implementing them in 
software. 


2.1 Wireless PHY 


The role of the PHY layer is to convert information bits 
into a radio waveform, or vice versa. At the transmitter 
side, the wireless PHY component first modulates the 
message (i.e., a packet or a MAC frame) into a time se- 
quence of baseband signals. Baseband signals are then 
passed to the radio front-end, where they are multiplied 
by a high frequency carrier and transmitted into the 
wireless channel. At the receiver side, the radio front- 
end detects signals in the channel and extracts the base- 
band signal by removing the high-frequency carrier. The 
extracted baseband signal is then fed into the receiver’s 
PHY layer to be demodulated into the original message. 

Advanced communication systems (e.g., [EEE 
802.1 la/b/g, as shown in Figure 1) contain multiple 
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(b) IEEE 802.11a/g 24Mbps 


Figure 1: PHY operations of IEEE 802.1 1a/b/g transceiver. 


functional blocks in their PHY components. These 
functional blocks are pipelined with one another. Data 
are streamed through these blocks sequentially, but with 
different data types and sizes. As illustrated in Figure 1, 
different blocks may consume or produce different types 
of data in different rates arranged in small data blocks. 
For example, in 802.11b, the scrambler may consume 
and produce one bit, while DQPSK modulation maps 
each two-bit data block onto a complex symbol which 
uses two 16-bit numbers to represent the in-phase and 
quadrature (I/Q) components. 

Each PHY block performs a fixed amount of compu- 
tation on every transmitted or received bit. When the 
data rate is high, e.g., 11 Mbps for 802.11b and 54Mbps 
for 802.1la/g, PHY processing blocks consume a sig- 
nificant amount of computational power. Based on the 
model in [19], we estimate that a direct implementation 
of 802.11b may require 10Gops while 802.1 1a/g needs 
at least 40Gops. These requirements are very demand- 
ing for software processing in GPPs. 

PHY processing blocks directly operate on the dig- 
ital waveforms after modulation on the transmitter side 
and before demodulation on the receiver side. Therefore, 
high-throughput interfaces are needed to connect these 
processing blocks as well as to connect the PHY and 
radio front-end. The required throughput linearly scales 
with the bandwidth of the baseband signal. For example, 
the channel bandwidth is 2O0MHz in 802.1 1a. It requires 
a data rate of at least 20M complex samples per second 
to represent the waveform [14]. These complex samples 
normally require 16-bit quantization for both I and Q 
components to provide sufficient fidelity, translating into 
32 bits per sample, or 640Mbps for the full 20MHz chan- 
nel. Over-sampling, a technique widely used for better 
performance [12], doubles the requirement to 1.28Gbps 


to move data between the RF frond-end and PHY blocks 
for one 802.1 1a channel. 


2.2 Wireless MAC 


The wireless channel is a resource shared by all 
transceivers operating on the same spectrum. As si- 
multaneously transmitting neighbors may interfere with 
each other, various MAC protocols have been developed 
to coordinate their transmissions 1n wireless networks to 
avoid collisions. 

Most modern MAC protocols, such as 802.11, require 
timely responses to critical events. For example, 802.11 
adopts a CSMA (Carrier-Sense Multiple Access) MAC 
protocol to coordinate transmissions [7]. Transmitters 
are required to sense the channel before starting their 
transmission, and channel access is only allowed when 
no energy is sensed, i.e., the channel is free. The latency 
between sense and access should be as small as possible. 
Otherwise, the sensing result could be outdated and inac- 
curate. Another example is the link-layer retransmission 
mechanisms in wireless protocols, which may require an 
immediate acknowledgement (ACK) to be returned in a 
limited time window. 

Commercial standards like IEEE 802.11 mandate a 
response latency within tens of microseconds, which is 
challenging to achieve in software on a general purpose 
PC with a general purpose OS. 


2.3 Software Radio Requirements 


Given the above discussion, we summarize the require- 
ments for implementing a software radio system on a 
general PC platform: 


High system throughput. The interfaces between the 
radio front-end and PHY as well as between some 
PHY processing blocks must possess sufficiently high 
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Figure 2: Sora system architecture. All PHY and MAC 
execute in software on a commodity multi-core CPU. 


throughput to transfer high-fidelity digital waveforms. 
To support a 20MHz channel for 802.11, the interfaces 
must sustain at least 1.28Gbps. Conventional inter- 
faces like USB 2.0 (< 480Mbps) or Gigabit Ethernet 
(< 1Gbps) cannot meet this requirement [1]. 


Intensive computation. High-speed wireless protocols 
require substantial computational power for their PHY 
processing. Such computational requirements also in- 
crease proportionally with communication speed. Un- 
fortunately, techniques used in conventional PHY hard- 
ware or embedded DSPs do not directly carry over to 
GPP architectures. Thus, we require new software tech- 
niques to accelerate high-speed signal processing on 
GPPs. With the advent of many-core GPP architec- 
tures [9], it is now reasonable to dedicate computational 
power solely to signal processing. But, it is still chal- 
lenging to build a software architecture to efficiently ex- 
ploit the full capability of multiple cores. 


Real-time enforcement. Wireless protocols have mul- 
tiple real-time deadlines that need to be met. Conse- 
quently, not only is processing throughput a critical re- 
quirement, but the processing latency needs to meet re- 
sponse deadlines. Some MAC protocols also require 
precise timing control at the granularity of microseconds 
to ensure certain actions occur at exactly pre-scheduled 
time points. Meeting such real-time deadlines on a gen- 
eral PC architecture is a non-trivial challenge: time shar- 
ing operation systems may not respond to an event in a 
timely manner, and bus interfaces, such as Gigabit Eth- 
ernet, could introduce indefinite delays far more than a 
few jus. Therefore, meeting these real-time requirements 
requires new mechanisms on GPPs. 


3 Architecture 


We have developed a high-performance software radio 
platform called Sora that addresses these challenges. It 
is based on a commodity general-purpose PC architec- 
ture. For flexibility and programmability, we push as 
much communication functionality as possible into soft- 
ware, while keeping hardware additions as simple and 
generic as possible. Figure 2 illustrates the overall sys- 
tem architecture. 
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3.1 Hardware Components 


The hardware components in the Sora architecture are 
a new radio control board (RCB) with an interchange- 
able radio front-end (RF front-end). The radio front- 
end is a hardware module that receives and/or trans- 
mits radio signals through an antenna. In the Sora ar- 
chitecture, the RF front-end represents the well-defined 
interface between the digital and analog domains. It 
contains analog-to-digital (A/D) and digital-to-analog 
(D/A) converters, and necessary circuitry for radio trans- 
mission. During receiving, the RF front-end acquires 
an analog waveform from the antenna, possibly down- 
converts it to a lower frequency, and then digitizes it into 
discrete samples before transferring them to the RCB. 
During transmitting, the RF front-end accepts a syn- 
chronous stream of software-generated digital samples 
and synthesizes the corresponding analog waveform be- 
fore emitting it using the antenna. Since all signal pro- 
cessing is done in software, the RF front-end design 
can be rather generic. It can be implemented in a self- 
contained module with a standard interface to the RCB. 
Multiple wireless technologies defined on the same fre- 
quency band can use the same RF front-end hardware, 
and the RCB can connect to different RF front-ends de- 
signed for different frequency bands. 


The RCB is a new PC interface board for establish- 
ing a high-throughput, low-latency path for transfer- 
ring high-fidelity digital signals between the RF front- 
end and PC memory. To achieve the required system 
throughput discussed in Section 2.1, the RCB uses a 
high-speed, low-latency bus such as PCIe [8]. With a 
maximum throughput of 64Gbps (PCle x32) and sub- 
microsecond latency, it is well-suited for supporting 
multiple gigabit data rates for wireless signals over a 
very wide band or over many MIMO channels. Fur- 
ther, the PCIe interface is now common in contemporary 
commodity PCs. 


Another important role of the RCB is to bridge the 
synchronous data transmission at the RF front-end and 
the asynchronous processing on the host CPU. The RCB 
uses various buffers and queues, together with a large 
on-board memory, to convert between synchronous and 
asynchronous streams and to smooth out bursty trans- 
fers between the RCB and host memory. The large 
on-board memory further allows caching pre-computed 
waveforms, adding additional flexibility for software ra- 
dio processing. 


Finally, the RCB provides a low-latency control path 
for software to control the RF front-end hardware and 
to ensure it 1s properly synchronized with the host CPU. 
Section 5.1 describes our implementation of the RCB in 
more detail. 
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Figure 3: Software architecture of Sora soft-radio stack. 


3.2 Sora Software 


Figure 3 illustrates Sora’s software architecture. The 
software components in Sora provide necessary sys- 
tem services and programming support for implement- 
ing various wireless PHY and MAC protocols in a 
general-purpose operating system. In addition to fa- 
cilitating the interaction with the RCB, the Sora soft- 
radio stack provides a set of techniques to greatly im- 
prove the performance of PHY and MAC processing on 
GPPs. To meet the processing and real-time require- 
ments, these techniques make full use of various com- 
mon features in existing multi-core CPU architectures, 
including the extensive use of lookup tables (LUTs), 
substantial data-parallelism with CPU SIMD extensions, 
the efficient partitioning of streamlined processing over 
multiple cores, and exclusive dedication of cores for 
software radio tasks. 


4 High-Performance SDR Processing 


In this section we describe the software techniques used 
by Sora to achieve high-performance SDR processing. 


4.1 Efficient PHY processing 


In a memory-for-computation tradeoff, Sora relies upon 
the large-capacity, high-speed cache memory in GPPs to 
accelerate PHY processing with pre-calculated lookup 
tables (LUTs). Contemporary modern CPU architec- 
tures, such as Intel Core 2, usually have megabytes of 
L2 cache with a low (10~20 cycles) access latency. If 
we pre-calculate LUTs for a large portion of PHY algo- 
rithms, we can greatly reduce the computational require- 
ment for on-line processing. 

For example, the soft demapper algorithm used in de- 
modulation needs to calculate the confidence level of 
each bit contained in an incoming symbol. This task 
involves rather complex computation proportional to the 


modulation density. More precisely, it conducts an ex- 
tensive search for all modulation points in a constella- 
tion graph and calculates a ratio between the minimum 
of Euclidean distances to all points representing one and 
the minimum of distances to all points representing zero. 
In this case, we can pre-calculate the confidence levels 
for all possible incoming symbols based on their I and 
Q values, and build LUTs to directly map the input sym- 
bol to confidence level. Such LUTs are not large. For 
example, in 802.1 1la/g with a 54Mbps modulation rate 
(64-QAM), the size of the LUT for the soft demapper is 
only 1.5KB. 

As we detail later in Section 5.2.1, more than half 
of the common PHY algorithms can indeed be rewrit- 
ten with LUTs, each with a speedup from 1.5x to 50x. 
Since the size of each LUT is sufficiently small, the sum 
of all LUTs in a processing path can easily fit in the L2 
caches of contemporary GPP cores. With core dedica- 
tion (Section 4.3), the possibility of cache collisions is 
very small. As a result, these LUTs are almost always in 
caches during PHY processing. 

To accelerate PHY processing with data-level paral- 
lelism, Sora heavily uses the SIMD extensions in mod- 
erm GPPs, such as SSE, 3DNow!, and AltiVec. Al- 
though these extensions were designed for multimedia 
and graphics applications, they also match the needs of 
wireless signal processing very well because many PHY 
algorithms have fixed computation structures that can 
easily map to large vector operations. In Appendix A, 
we show an example of an optimized digital filter imple- 
mentation using SSE instructions. As our measurements 
later show, such SIMD extensions substantially speed up 
PHY processing in Sora. 


4.2 Miulti-core streamline processing 


Even with the above optimizations, a single CPU core 
may not have sufficient capacity to meet the process- 
ing requirements of high-speed wireless communication 
technologies. As a result, Sora must be able to use 
more than one core in a multi-core CPU for PHY pro- 
cessing. This multi-core technique should also be scal- 
able because the signal processing algorithms may be- 
come increasingly more complex as wireless technolo- 
gies progress. 

As discussed in Section 2, PHY processing typically 
contains several functional blocks in a pipeline. These 
blocks differ in processing speed and in input/output 
data rates and units. A block is only ready to execute 
when it has sufficient input data from the previous block. 
Therefore, a key issue is how to schedule a functional 
block on multiple cores when it is ready. 

One possible approach is to run multiple PHY 
pipelines on different cores (Figure 4(a)), and have 
the scheduler dispatch batches of digital samples to a 
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Figure 4: PHY pipeline scheduling: (a) parallel 
pipelines, (b) dynamic scheduling, (c) static scheduling. 


pipeline. This approach, however, does not work well 
for SDR because wireless communication has strong de- 
pendencies in a data stream. For example, in convolu- 
tional encoding the output of each bit also depends on 
the seven preceding bits in the input stream. Without 
the scheduler knowing all of the data dependencies, it is 
difficult to produce an efficient schedule. 

An alternative scheduling approach is to have only 
one pipeline and dynamically assign ready blocks to 
available cores (Figure 4(b)), in a way similar to thread 
scheduling in a multi-core system. Unfortunately, this 
approach would introduce prohibitively high overhead. 
On the one hand, any two adjacent blocks may be sched- 
uled onto two different cores, thereby requiring synchro- 
nized FIFO (SFIFO) communication between them. On 
the other hand, most PHY processing blocks operate on 
very small data items, e.g., 1-4 bytes each, and the pro- 
cessing only takes a few operations (several to tens of in- 
structions). Such frequent FIFO and synchronization op- 
erations are not justifiable for such small computational 
tasks. 

Instead, Sora chooses a static scheduling scheme. 
This decision is based on the observation that the sched- 
ule of each block in a PHY processing pipeline is ac- 
tually static: the processing pattern of previous blocks 
can determine whether a subsequent block is ready or 
not. Sora can thus partition the whole PHY processing 
pipeline into several sub-pipelines and statically assign 
them to different cores (Figure 4(c)). Within one sub- 
pipeline, when a block has accumulated enough data for 
the next block to be ready, it explicitly schedules the next 
block. Adjacent sub-pipelines from different blocks are 
still connected with an SFIFO, but the number of SFI- 
FOs and their overhead are greatly reduced. 


4.3 Real-time support 


SDR processing is a time-critical task that requires strict 
guarantees of computational resources and hard real- 
time deadlines. As an alternative to relying upon the 
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Figure 5: Sora radio control board. 


full generality of real-time operating systems, we can 
achieve real-time guarantees by simply dedicating cores 
to SDR processing in a multi-core system. Thus, suffi- 
cient computational resources can be guaranteed without 
being affected by other concurrent tasks in the system. 

This approach is particularly plausible for SDR. First, 
wireless communication often requires its PHY to con- 
stantly monitor the channel for incoming signals. There- 
fore, the PHY processing may need to be active all the 
time. It 1s much better to always schedule this task on 
the same core to minimize overhead like cache misses 
or TLB flushes. Second, previous work on multi-core 
OSes also suggests that isolating applications into dif- 
ferent cores may have better performance compared to 
symmetric scheduling, since an effective use of cache 
resources and a reduction in locks can outweigh dedicat- 
ing cores [10]. Moreover, a core dedication mechanism 
is much easier to implement than a real-time scheduler, 
sometimes even without modifying an OS kernel. For 
example, we can simply raise the priority of a kernel 
thread so that it is pinned on a core and it exclusively 
runs until termination (Section 5.2.3). 


5 Implementation 


We have implemented both the hardware and software 
components of Sora. This section describes our hard- 
ware prototype and software stack, and presents mi- 
crobenchmark evaluations of Sora components. 


5.1 Hardware 


We have designed and implemented the Sora radio con- 
trol board (RCB) as shown in Figure 5. It contains 
a Virtex-5 FPGA, a PClIe-x8 interface, and 256MB of 
DDR2 SDRAM. The RCB can connect to various RF 
front-ends. In our experimental prototype, we use a 
third-party RF front-end, developed by Rice Univer- 
sity [6], that is capable of transmitting and receiving a 
20MHz channel at 2.4GHz or 5GHz. 

Figure 6 illustrates the logical components of the Sora 
hardware platform. The DMA and PCIe controllers in- 
terface with the host and transfer digital samples be- 
tween the RCB and PC memory. Sora software sends 
commands and reads RCB states through RCB regis- 
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Figure 6: Hardware architecture of RCB and RF. 


ters. The RCB uses its on-board SDRAM as well as 
small FIFOs on the FPGA chip to bridge data streams 
between the CPU and RF front-end. When receiving, 
digital signal samples are buffered in on-chip FIFOs and 
delivered into PC memory when they fit ina DMA burst 
(128 bytes). When transmitting, the large RCB memory 
enables Sora software to first write the generated sam- 
ples onto the RCB, and then trigger transmission with 
another command to the RCB. This functionality pro- 
vides flexibility to the Sora software for pre-calculating 
and storing several waveforms before actually transmit- 
ting them, while allowing precise control of the timing 
of the waveform transmission. 

While implementing Sora, we encountered a consis- 
tency issue in the interaction between DMA operations 
and the CPU cache system. When a DMA operation 
modifies a memory location that has been cached in the 
L2 cache, it does not invalidate the corresponding cache 
entry. When the CPU reads that location, it can there- 
fore read an incorrect value from the cache. One naive 
solution is to disable cached accesses to memory regions 
used for DMA, but doing so will cause a significant 
degradation in memory access throughput. 

We solve this problem with a smart-fetch strat- 
egy, enabling Sora to maintain cache coherency with 
DMA memory without drastically sacrificing through- 
put. First, Sora organizes DMA memory into small slots, 
whose size is a multiple of a cache line. Each slot begins 
with a descriptor that contains a flag. The RCB sets the 
flag after it writes a full slot of data, and cleared after 
the CPU processes all data in the slot. When the CPU 
moves to a new slot, it first reads its descriptor, causing 
a whole cache line to be filled. If the flag is set, the data 
just fetched is valid and the CPU can continue process- 
ing the data. Otherwise, the RCB has not updated this 
slot with new data. Then, the CPU explicitly flushes the 
cache line and repeats reading the same location. This 
next read refills the cache line, loading the most recent 
data from memory. 


5.2 Software 


The Sora software is written in C, with some assem- 
bly for performance-critical processing. The entire Sora 


software stack is implemented on Windows XP as a net- 
work device driver and it exposes a virtual Ethernet in- 
terface to the upper TCP/IP stack. Since any software 
radio implemented on Sora can appear as a normal net- 
work device, all existing network applications can run 
unmodified on it. 

The Sora software currently consists of 23,325 non- 
blank lines of C code. Of this total, 14,529 lines are for 
system support, including driver framework, memory 
management, streamline processing, etc. The remaining 
8,796 lines comprise the PHY processing library. 


5.2.1 PHY processing library 


In the Sora PHY processing library, we extensively ex- 
ploit the use of look-up tables (LUTs) and SIMD in- 
structions to optimize the performance of PHY algo- 
rithms. We have been able to rewrite more than half 
of the PHY algorithms with LUTs. Some LUTs are 
straightforward pre-calculations, others require more so- 
phisticated implementations to keep the LUT size small. 
For the soft-demapper example mentioned earlier, we 
can greatly reduce the LUT size (e.g., 1.5KB for the 
802.1 la/g 54Mbps modulation) by exploiting the sym- 
metry of the algorithm. In our SoftWiFi implementa- 
tion described below, the overall size of the LUTs used 
in 802.1 1la/g is around 200KB and 310KB in 802.11b, 
both of which fit comfortably within the L2 caches of 
commodity CPUs. 

We also heavily use SIMD instructions in coding Sora 
software. We currently use the SSE2 instruction set de- 
signed for Intel CPUs. Since the SSE registers are 128- 
bit wide while most PHY algorithms require only 8-bit 
or 16-bit fixed-point operations, one SSE instruction can 
perform 8 or 16 simultaneous calculations. SSE2 also 
has rich instruction support for flexible data permuta- 
tions, and most PHY algorithms, e.g., FFT, FIR Filter 
and Viterbi, can fit naturally into this SIMD model. For 
example, the Sora Viterbi decoder uses only 40 cycles to 
compute the branch metric and select the shortest path 
for each input. As a result, our Viterbi implementation 
can handle 802.11a/g at the 54Mbps modulation with 
only one 2.66GHz CPU core, whereas previous imple- 
mentations relied on hardware implementations. Note 
that other GPP architectures, like AMD and PowerPC, 
have very similar SIMD models and instruction sets; 
AMD’s Enhanced 3DNow!, for instance, includes SSE 
instructions plus a set of DSP extensions. We expect 
that our optimization techniques will directly apply to 
these other GPP architectures as well. In Appendix A, 
we show a simple example of a functional block using 
SIMD instruction optimizations. 

Table 1 summarizes some key PHY processing algo- 
rithms we have implemented in Sora, together with the 
optimization techniques we have applied. The table also 
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Optimization 


Method Computation Required (Mcycles/sec) 
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| TAUMDPSSCOR, | eB 
FIR Filter 16-bit /Q, 37 taps, 22MSps SIMD 5,780.34 616.41 
16-bit /Q, 4x Oversample rerarara a SIMD 422.45 198.72 


IEEE 802.11a 











FFTVIFFT 64716"2 | 64*16"2 SIMD 754.11 459.52 
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24Mbps, 1/2 rate 


A8Mbps, 2/3 rate 


Viterbi 8*16 


24Mbps, 1/2 rate 


A8Mbps, 2/3 rate 


54Mbps, 3/4 rate 


CS4Mbps, trate 
24MIbps, QAM 16 115.05 | __46.55 


B32 
Soft demapper 


S4Mbps, QAM 61 255.86 
[TCS 


Scramble & Descramble | 54Mbps 


C8 
-48Mbps, 73rate—SC*dSSSC«*d 
8 | SIMD+LUT_[_68.55357 | 1,408.93 | 





406.08 


688.55 
32 [LUT 712.10 


S721 18.5x 


3721] 
56.23 


48.7x 


08.75 


547.86 | 40.29 


Table 1: Key algorithms in IEEE 802.11b/a and their performance with conventional and Sora implementations. 


compares the performance of a conventional software 
implementation (e.g., a direct translation from a hard- 
ware implementation) and the Sora implementation with 
the LUT and SIMD optimizations. 


5.2.2 Lightweight, synchronized FIFOs 


Sora allows different PHY processing blocks to stream- 
line across multiple cores while communicating with 
one another through shared memory FIFO queues. If 
two blocks are running on different cores, their access 
to the shared FIFO must be synchronized. The tradi- 
tional implementation of a synchronized FIFO uses a 
counter to synchronize the writer and reader, which we 
refer to as a counter-based FIFO (CBFIFO) and illustrate 
in Figure 7(a). However, this counter is shared by two 
processor cores, and every write to the variable by one 
core will cause a cache miss on the other core. Since 
both the producer and consumer modify this variable, 
two cache misses are unavoidable for each datum. It is 
also quite common to have very fine data granularity in 
PHY (e.g., 4-16 bytes as summarized in Table 1). There- 
fore, such cache misses will result in significant over- 
head when synchronization has to be performed very 
frequently (e.g., once per micro-second) for such small 
pieces of data. 

In Sora, we implement another synchronized FIFO 
that removes the sole shared synchronization variable. 
The idea is to augment each data slot in the FIFO with 
a header that indicates whether the slot is empty or not. 
We pad each data slot to be a multiple of a cache line. 
Thus, the consumer is always chasing the producer in 
the circular buffer for filled slots, as outlined in Figure 
7(b). This chasing-pointer FIFO (CPFIFO) largely mit- 
igates the overhead even for very fine-grained synchro- 
nization. If the speed of the producer and consumer is 
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// producer: 

void write _fifo ( DATA_TYPE data ) { 
while (cnt >= q_size); // spin wait 
g[w_tail] = data; 
w_tail = (w_tail+l) % q_size; 
InterlockedIncrement (cnt); // increase cnt by 1 
} 

// consumer: 

void read_fifo ( DATA_TYPE x pdata ) { 
while (cnt==0); // spin wait 
* pdata = gq[r_head]; 
r_head = (r_head+l) % g_size; 
InterlockedDecrement (cnt); // decrease cnt by 1 


} 
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(a) 


// producer: 


1 

2 void write fifo ( DATA TYPE data ) { 

3 while (q[w_tail].flag>0); // spin wait 
4 g[w_tail].data = data; 

5 q[w_tail].flag = 1; // occupied 

6 w_tail = (w_tail+l) % q_size; 

7 4 

1 // consumer: 

2 void read_fifo ( DATA_TYPE * pdata ) { 
3 while (q[r_head] .flag==0); // spin 

4 xdata = q[r_head].data; 

5 q{r_head].flag = 0; // release 

6 r_head = (r_head + 1) % g_size; 

7 


(b) 


Figure 7: Pseudo-code for synchronized (a) CBFIFOs 
and (b) CPFIFOs. 


the same and the two pointers are separated by a partic- 
ular offset (e.g., two cache lines in the Intel architecture), 
no cache miss will occur during synchronized streaming 
since the local cache will prefetch the following slots be- 
fore the actual access. If the producer and the consumer 
have different processing speeds, e.g., the reader is faster 
than the writer, then eventually the consumer will wait 
for the producer to release a slot. In this case, each time 
the producer writes to a slot, the write will cause a cache 
miss at the consumer. But the producer will not suffer 
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Table 2: DMA throughput performance of the RCB. 


Method Memory Throughput 


Cache Disabled 707.2Mbps 
10.1Gbps 


Table 3: Memory throughput. 


a miss since the next free slot will be prefetched into its 
local cache. Fortunately, such cache misses experienced 
by the consumer will not cause significant impact on the 
overall performance of the streamline processing since 
the consumer is not the bottleneck element. 


5.2.3. Real-time support 


Sora uses exclusive threads (or ethreads) to dedicate 
cores for real-time SDR tasks. Sora implements ethreads 
without any modification to the kernel code. An ethread 
is implemented as a kernel-mode thread, and it exploits 
the processor affiliation that is commonly supported in 
commodity OSes to control on which core it runs. Once 
the OS has scheduled the ethread on a specified physical 
core, it will raise its IRQL (interrupt request level) to a 
level as high as the kernel scheduler, e.g., dispatch_level 
in Windows. Thus, the ethread takes control of the 
core and prevents itself from being preempted by other 
threads. 

Running at such an IRQL, however, does not prevent 
the core from responding to hardware interrupts. There- 
fore, we also constrain the interrupt affiliations of all 
devices attached to the host. If an ethread is running on 
one core, all interrupt handlers for installed devices are 
removed from the core, thus prevent the core from being 
interrupted by hardware. To ensure the correct operation 
of the system, Sora always ensures core zero is able to 
respond to all hardware interrupts. Consequently, Sora 
only allows ethreads to run on cores whose ID is greater 
than zero. 


5.3. Evaluation 


We measure the performance of the Sora implementa- 
tion with microbenchmark experiments. We perform all 
measurements on a Dell XPS PC with an Intel Core 2 
Quad 2.66GHz CPU (Section 7.1 details the complete 
hardware configuration). 


Throughput and latency. ‘To measure PCle through- 
put, we instruct the RCB to read/write a number of de- 
scriptors from/to main memory via DMA, and measure 
the time taken. Table 2 summarizes the results, which 
agree with the hardware specifications. 

To precisely measure PCIe latency, we instruct the 
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Figure 8: Overhead of synchronized FIFOs. 


RCB to read a memory address in host memory. We 
measure the time interval between issuing the request 
and receiving the response data in hardware. Since the 
memory read operation accesses the PCIe bus using a 
round trip operation, we use half of the measured time 
to estimate the one-way delay. This one-way delay is 
360ns with a worst case variation of 4ns. We also con- 
firm that the RCB hardware itself induces negligible de- 
lay except for buffers on the data path. However, such 
delay is tiny when the buffer is small. For example, the 
DMA burst size is 128 bytes, which causes only 76ns 
latency in PCle-x8. 


Table 3 compares measured memory throughput in 
two different cases. The first row shows the read 
throughput of uncacheable memory. It is only 707Mbps, 
which is insufficient for 802.11 processing. The second 
row shows the performance of the smart-fetch technique. 
With smart-fetch, the memory throughput is a factor of 
14 greater compared to the uncacheable case, and suffi- 
cient for supporting high-speed protocol processing. We 
note, however, that it is still slower than reading from 
normal cacheable memory without having to be consis- 
tent with DMA operations. This reduction is due to the 
overhead of additional cache-line invalidations. 


Synchronized FIFO. To measure the overhead of the 
synchronized CBFIFO and CPFIFO implementations, 
we process ten thousand data inputs through the FIFOs 
first on one core, and then on two cores. We also vary 
the number of cycles to process each datum to change 
the ratio of synchronization time with processing time. 
When processing with two cores, we allocate the same 
computation to each core. Denote ¢; and tz as the com- 
pletion times of processing on one core and two cores, 
respectively. We then define the overhead of a synchro- 


nized FIFO as ane 


Figure 8 shows the results of this experiment. The x- 
axis shows the total processing cycles required for each 
datum, and the y-axis shows the overhead of the syn- 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 


83 


84 


chronized FIFO. We make following observations from 
these results. First, partitioning work across cores gives 
different overheads depending upon whether the cores 
are on the same die. Two cores on the same die share the 
same L2 cache, while cores on different dies are con- 
nected via a shared front-side bus. Thus, streaming data 
between functional blocks across cores on the same die 
has significantly less overhead than streaming between 
cores on different dies. 

Second, the overhead decreases as the computation 
time per datum increases, as expected. When the compu- 
tation per datum is very short, the communication over- 
head between cores dominates. The Intel CPU requires 
about 10 cycles to access its local L2 cache, and 100 cy- 
cles to access a remote cache. Therefore, when there are 
40 cycles per datum, the overhead is at least ” = 50% 
when two cores are on one die, and a = 500% when 
two cores are on different dies. The CPFIFO almost 
achieves this lower bound. When there is more com- 
putation required per datum, however, the data transfer 
can be overlapped with computation, enabling the over- 
head to be hidden. Finally, the CBFIFO generally has 
significantly higher overhead compared to the CPFIFO 
due to the additional synchronization overhead on the 
shared variable, which the CPFIFO avoids. 


6 Case study: SoftWiFi 


To demonstrate the use of Sora, we have developed a 
fully functional WiFi transceiver on the Sora platform 
called SoftWiFi. Our SoftWiFi stack supports all IEEE 
802.1 1la/b/g modulations and can communicate seam- 
lessly with commercial WiFi network cards. 

Figure 9 illustrates the Sora SoftWiFi implementa- 
tion. The MAC state machine (SM) is implemented 
as an ethread. Since 802.11 is a half-duplex radio, 
the demodulation components can run directly within 
a MAC SM thread. If a single core is insufficient for 
all PHY processing (e.g., 802.1 1a/g), the PHY process- 
ing can be partitioned across two ethreads. These two 
ethreads are streamlined using a CPFIFO. An additional 
thread, Snd_thread, modulates the outgoing frames into 
waveform samples in the background. These modulated 
waveforms can be pre-stored in the RCB’s memory to 
facilitate transmission. The Completion_thread moni- 
tors the Rcv_buf and notifies upper software layers of 
any correctly received frames. This thread also cleans 
up the snd and rcv buffers after they are used. 

SoftWiFi implements the basic access mode of 
802.11. The detailed MAC SM is shown in Figure 10. 
Normally, the SM is in the Frame Detection (FD) state. 
In that state, the RCB constantly writes samples into 
the Rx_buf. The SM continuously measures the aver- 
age energy to determine whether the channel is clean or 
whether there is an incoming frame. 
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Figure 9: SoftWiFi implementation. 
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Figure 10: State machine of the SoftWiFi MAC. 


The transmission of a frame follows the CSMA mech- 
anism. When there is a pending frame, the SM first 
needs to check if the energy on the channel is low. If 
the channel is busy, the transmission should be deferred 
and a backoff timer started. Each time the channel be- 
comes free, the SM checks if any backoff time remains. 
If the timer goes to zero, it transmits the frame. 

SoftWiFi starts to receive a frame if it detects a high 
energy in the FD state. In 802.11, it takes three steps in 
the PHY layer to receive a frame. First, the PHY layer 
needs to synchronize to the frame, i.e., find the start- 
ing point of the frame (timing synchronization) and the 
frequency offset and phase of the sample stream (car- 
rier synchronization). Synchronization is usually done 
by correlating the incoming samples with a pre-defined 
preamble. Subsequently, the PHY layer needs to demod- 
ulate the PLCP (Physical Layer Convergence Protocol) 
header, which is always transmitted using a fixed low- 
rate modulation mode. The PLCP header contains the 
length of the frame as well as the modulation mode, pos- 
sibly a higher rate, of the frame data that follows. Thus, 
only after successful reception of the PLCP header will 
the PHY layer know how to demodulate the remainder 
of the frame. 

After successfully receiving a frame, the 802.11 MAC 
standard requires a station to transmit an ACK frame in 
a timely manner. For example, 802.11b requires that an 
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ACK frame be sent with a 10js delay. However, this 
ACK requirement is quite difficult for an SDR imple- 
mentation to achieve in software on a PC. Both generat- 
ing and transferring the waveform across the PC bus will 
cause a latency of several microseconds, and the sum 
is usually larger than mandated by the standard. Fortu- 
nately, an ACK frame generally has a fixed pattern. For 
example, in 802.11 all data in an ACK frame is fixed 
except for the sender address of the corresponding data 
frame. Thus, in SoftWiFi, we can precalculate most of 
an ACK frame (19 bytes), and update only the address 
(10 bytes). Further, we can do it early in the process- 
ing, immediately after demodulating the MAC header, 
and without waiting for the end of a frame. We then pre- 
store the waveform into the memory of the RCB. Thus, 
the time for ACK generation and transferring can over- 
lap with the demodulation of the data frame. After the 
MAC SM demodulates the entire frame and validates the 
CRC32 checksum, it instructs the RCB to transmit the 
ACK, which has already been stored on the RCB. Thus, 
the latency for ACK transmission is very small. 

In rare cases when the incoming data frame is quite 
small (e.g., the frame contains only a MAC header and 
zero payload), then SoftWiFi cannot fully overlap ACK 
generation and the DMA transfer with demodulation to 
completely hide the latency. In this case, SoftWiFi may 
fail to send the ACK in time. We address this problem 
in SoftWiFi by maintaining a cache of previous ACKs 
in the RCB. With 802.11, all data frames from one node 
will have exactly the same ACK frame. Thus, we can 
use pre-allocated memory slots in the RCB to store ACK 
waveforms for different senders (we currently allocate 
64 slots). Now, when demodulating a frame, if the ACK 
frame is already in the RCB cache, the MAC SM sim- 
ply instructs the RCB to transmit the pre-cached ACK. 
With this scheme, SoftWiFi may be late on the first small 
frame from a sender, effectively dropping the packet 
from the sender’s perspective. But retransmissions, and 
all subsequent transmissions, will find the appropriate 
ACK waveform already stored in the RCB cache. 

We have implemented and tested the full 802.1 1a/g/b 
SoftWiFi tranceivers, which support DSSS (Direct Se- 
quence Spreading: 1 and 2Mbps in 11b), CCK (Com- 
plementary Code Keying: 5.5 and 11 Mbps in 11b), and 
OFDM (Orthogonal Frequency Division Multiplexing: 
6, 9 and up to 54Mbps in I|1a/g). It took one student 
about one month to develop and test 11b on Sora, and an- 
other student one and half months to code and test 1 la/g; 
these efforts also include the time for implementing the 
corresponding algorithms in the PHY library. 


7 Evaluations 


In this section we evaluate the end-to-end applica- 
tion performance delivered by Sora. Our goals are to 
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ing with a commercial WiFi card. Sora—Commercial 
presents the transmission throughput when a Sora node 
sends data. Commercial—Sora presents the through- 
put when a Sora node receives data. Commercial— 
Commercial presents the throughput when a commercial 
NIC communicates with another commercial NIC. 


show that Sora interoperates seamlessly with commer- 
cial 802.11 devices, and that the Sora SoftWiFi imple- 
mentation achieves equivalent performance. As a result, 
we show that Sora can process signals sufficiently fast to 
achieve full channel utilization, and that it can satisfy all 
timing requirements of the 802.11 standards with a soft- 
ware implementation on a GPP. We also characterize the 
CPU utilization of the software processing. In the fol- 
lowing, we sometimes use the label | 1a/g to present data 
for both 1 1la/g, since 11a and 11g have exactly the same 
OFDM PHY specification. 


7.1 Experimental setup 


The experimental setup consists of two high-end Dell 
XPS PCs (Intel Core 2 Quad 2.66GHz CPU, 4GB DDR2 
400MHz SDRAM, and two PCIe-16x slots) and two lap- 
tops, all running Window XP. Each Dell PC equips a 
Sora radio control board (RCB) with an 802.11 RF board 
(Section 5) and runs Sora and the SoftWiFi implemen- 
tation. Each CPU core has 32KB instruction and 32KB 
data L1 caches and a 2MB L2 cache. The Dell laptops 
use commercial WiFi NICs. We have used several dif- 
ferent WiFi NICs in our experiments, including Netgear, 
Cisco and Intel devices. All give similar results. Thus, 
we present results just for the Netgear WAGS511 device 
(based on the Atheros AR5212 chipset). 


7.2 Throughput 


Figure 11 shows the transmitting and receiving through- 
put of a Sora SoftWiFi node when it communicates with 
a commercial WiFi NIC. In the “Sora—Commercial” 
configuration, the Sora node acts as a sender and gener- 
ates 1400-byte UDP frames and unicast transmits them 
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to a laptop equipped with a commercial NIC. In the 
“Commercial—Sora” configuration, the Sora node acts 
as a receiver, and the laptop generates the same work- 
load. The “Commercial—Commercial” configuration 
shows the throughput when both sender and receiver are 
commercial NICs. In all configurations, the hosts were 
at the same distance from each other and experienced 
very little packet loss. Figure 11 shows the throughput 
achieved for all configurations with the various modu- 
lation modes in |la/b/g. We show only three selective 
rates in | la/g for conciseness. The results are averaged 
over five runs (the variance was very small). 

We make a number of observations from these results. 
First, the Sora SoftWiFi implementation operates seam- 
lessly with commercial devices, showing that Sora Soft- 
WiFi is protocol compatible. Second, Sora SoftWiFi 
can achieve similar performance as commercial devices. 
The throughputs for both configurations are essentially 
equivalent, demonstrating that SoftWiFi (1) has the pro- 
cessing capability to demodulate all incoming frames at 
full modulation rates, and (2) it can meet the 802.11 tim- 
ing constraints for returning ACKs within the delay win- 
dow required by the standard. We note that the maximal 
achievable application throughput for 802.11 is less than 
80% of the PHY data rate, and the percentage decreases 
as the PHY data rate increases. This limit is due to the 
overhead of headers at different layers as well as the 
MAC overhead to coordinate channel access (i.e., carrier 
sense, ACKs, and backoff), and is a well-known prop- 
erty of 802.11 performance. 


7.3 CPU Utilization 


What is the processing cost of onloading all digital sig- 
nal processing into software on the host? Figure 12 
shows the CPU utilization of a Sora SoftWiFi node to 
support modulation/demodulation at the corresponding 
rate. We normalize the utilization to the processing ca- 
pability of one core. For receiving, higher modulation 
rates require higher CPU utilization due to the increased 
computational complexity of demodulating the higher 
rates. We can see that one core of a contemporary multi- 
core CPU can comfortably support all 11b modulation 
modes. With the 11Mbps rate, Sora SoftWiFi requires 
roughly 70% of the computational power of one core 
for real-time SDR processing. However, 802.1 1a/g PHY 
processing is more complex than 11b and may require 
two cores for receive processing. In our software im- 
plementation, the Viterbi decoder in 1la/g is the most 
computationally-intensive component. It alone requires 
more than 1.4 Gcycles/s at modulation rates higher than 
24Mbps (Table 1). Therefore, it is natural to partition 
the receive pipeline across two cores, with the Viterbi 
decoder on one core and the remainder on another. With 
the parallelism enabled by this streamline processing, 
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Figure 12: CPU Utilization of Sora. 


we reduce the delay to process one |1la/g symbol from 
4.8us to 3.948, meeting the requirement of the standard 
(i.e. 4us) for 54Mbps. Note that the CPU utilization is 
not completely linear with the modulation rates in 11b 
because the 5.5/1 1 Mbps rates use a different modulation 
scheme than with 1/2Mbps. 

The CPU utilization for transmission, however, is 
generally lower than the receiving case. Note that the 
utilization is constant for all 11b rates. Since the trans- 
mission part of 11b can be optimized effectively with 
LUTs, for different rates we just use different LUTs. In 
1 la/g, since all samples need to pass an IFFT, the com- 
putation requirements increase as the rate increases. 


7.4 Detailed processing costs 


The results in Figure 12 presented the overall CPU uti- 
lization for a Sora SoftWiFi receiving node. As dis- 
cussed in Section 6, a complete receiver has a number 
of stages: frame detection, frame synchronization, and 
demodulators for both the PLCP header and its data de- 
pending on the modulation mode. How does CPU uti- 
lization partition across these stages? Figure 13 shows 
the computational cost for each component for receiv- 
ing a 1400-byte UDP packet in each modulation mode; 
again, we show only three representative modulation 
rates for 1 la/g. Frame detection (FD) has the lowest uti- 
lization (11% of a 2.66GHz core for 11b and only 3.2% 
for 1la/g) and is constant across all modulation modes 
in each standard. Note that frame detection needs to ex- 
ecute even if there is no communication since a frame 
may arrive at any time. When Sora detects a frame, 
it uses 29% of a core to synchronize to the start of a 
frame (SYNC) for 11b, and it uses 20% of a core to syn- 
chronize to an |la/g frame. Then Sora can demodulate 
the PLCP header, which is always transmitted using the 
lowest modulation rate. It requires slightly less (27.5%) 
computation overhead than synchronization for 11b; but 
it needs much more computation (44%) for lla. De- 
modulation of the data (DATA) at the higher rates is the 
most computationally expensive step in a receiver. It re- 
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Figure 13: Detailed processing costs in WiFi PHY. 


quires 75% of a core at 11Mbps for 11b, and the utiliza- 
tion reaches exceeds one core (134%) for processing at 
54Mbps in 1la/g. This result indicates that we need to 
streamline the processing to at least two cores to support 
this modulation. 


S$ Extensions 


The flexibility of Sora allows us to develop interesting 
extensions to current WiFi protocol. 


$8.1 Jumbo Frames 


If the channel conditions are good, transmitting data 
with larger frames can reduce the overhead of MAC/- 
PHY headers, preambles and the per frame ACK. How- 
ever, the maximal frame size of 802.11 is fixed at 2304 
bytes. With simple modifications (changes in a few 
lines), SoftWiFi can transmit and receive jumbo frames 
with up to 32KB. Figure 14 shows the throughput of 
sending UDP packets between two Sora SoftWiFi nodes 
using the jumbo frame optimization across a range of 
frame sizes (with I1b using the 11Mbps modulation 
mode). When we increase the frame size from 1KB 
to 6KB, the end-to-end throughput increase 39% from 
5.9Mbps to 8.2Mbps. When we further increase the 
frame size to 7KB, however, the throughput drops be- 
cause the frame error rate also increases with the size. 
So, at some point, the increasing error will offset the gain 
of reducing the overhead. Note that our default commer- 
cial NIC rejects frames larger than 2304 bytes, even if 
those frames can be successfully demodulated. 


In this experiment, we place the antennas close to each 
other, clearly a best-case scenario. Our goal, though, 
is not to argue that jumbo frames for 802.11 are nec- 
essarily a compelling optimization. Rather, we want 
to demonstrate that the full programmability offered by 
Sora makes it both possible and straightforward to ex- 
plore such “what if” questions on a GPP SDR platform. 
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Figure 14: Throughput with Jumbo Frames between two 
Sora SoftWiFi nodes. 
50ms 


100ms 


Table 4: Timing error of Sora in TDMA. 





8.2. TDMA MAC 


To evaluate the ability of Sora to precisely control the 
transmission time of a frame, we implemented a simple 
TDMA MAC that schedules a frame transmission at a 
predefined time interval. The MAC state machine (SM) 
runs in an ethread, and it continuously queries a timer 
to check if the pre-defined amount of time has elapsed. 
If so, the MAC SM will instruct the RCB to send out a 
frame. The modification is simple and straightforward 
with about 20 lines of additional code. 

Since our RCB can indicate to SoftWiFi when the 
transmission completes, and we know the exact size of 
the frame, we can calculate the exact time when the 
frame transmits. Table 4 summarizes the results with 
various scheduling intervals under a heavy load, where 
we copy files on the local disk, download files from 
a nearby server, and playback a HD video simultane- 
ously. In the Table, € presents the average error and o 
presents the standard deviation of the error. The average 
error is less than 1jzs, which is sufficient for most wire- 
less protocols. We also list outliers, which we define 
as packet transmissions that occur later than 2us from 
the pre-defined schedule. Previous work has also imple- 
mented TDMA MACs on a commodity WiFi NIC [20], 
but their software architecture results in a timing error of 
near 100s. 


8.3 Soft Spectrum Analyzer. 


It is also easy for Sora to expose all PHY layer informa- 
tion to applications. One application we have found use- 
ful is a software spectrum analyzer for WiFi. We have 
implemented such a simple spectrum analyzer that can 
graphically display the waveform and modulation points 
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Figure 15: Software Spectrum Analyzer built on Sora. 


in a constellation graph, as well as the demodulated re- 
sults, as shown in Figure 15. Commercial spectrum ana- 
lyzers may have similar functionality and wider sensing 
spectrum band, but they are also more expensive. 


9 Related Work 


In this section we discuss various efforts to implement 
software defined radio functionality and platforms. 
Traditionally, device drivers have been the primary 
software mechanism for changing wireless functional- 
ity on general purpose computing systems. For example, 
the Mad WiFi drivers for cards with Atheros chipsets [3], 
HostAP drivers for Prism chipsets [2], and the rtx200 
drivers for RaLink chipsets [4] are popular driver suites 
for experimenting with 802.11. These drivers typically 
allow software to control a wide range of 802.11 man- 
agement tasks and non-time-critical aspects of the MAC 
protocol, and allow software to access some device hard- 
ware state and exercise limited control over device oper- 
ation (e.g., transmission rate or power). However, they 
do not allow changes to fundamental aspects of 802.11 
like the MAC packet format or any aspects of PHY. 
SoftMAC goes one step further to provide a platform 
for implementing customized MAC protocols using in- 
expensive commodity 802.11 cards [20]. Based on the 
MadWifFi drivers and associated open-source hardware 
abstraction layers, SoftMAC takes advantage of features 
of the Atheros chipsets to control and disable default 
low-level MAC behavior. SoftWAC enables greater flex- 
ibility in implementing non-standard MAC features, but 
does not provide a full platform for SDR. With the sepa- 
ration of functionality between driver software and hard- 
ware firmware on commodity devices, time critical tasks 
and PHY processing remain unchangeable on the device. 
GNU Radio is a popular software toolkit for building 
software radios using general purpose computing plat- 
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forms [1]. It is derived from an earlier system called 
SpectrumWare [22]. GNU Radio consists of a software 
library and a hardware platform. Developers implement 
software radios by composing modular pre-compiled 
components into processing graphs using python scripts. 
The default GNU Radio platform is the Universal Soft- 
ware Radio Peripheral (USRP), a configurable FPGA ra- 
dio board that connects to the host. As with Sora, GNU 
Radio performs much of the SDR processing on the host 
itself. Current USRP supports USB2.0 and a new ver- 
sion USRP 2.0 upgrades to Gigabit Ethernet. Such in- 
terfaces, though, are not sufficient for high speed wire- 
less protocols in wide bandwidth channels. Existing 
GNU Radio platforms can only sustain low-speed wire- 
less communication due to both the hardware constraints 
as well as software processing [21]. As a consequence, 
users must sacrifice radio performance for its flexibility. 

The WARP hardware platform provides a flexible and 
high-performance software defined radio platform [6]. 
Based on Xilinx FPGAs and PowerPC cores, WARP 
allows full control over the PHY and MAC layers and 
supports customized modulations up to 36 Mbps. A va- 
riety of projects have used WARP to experiment with 
new PHY and MAC features, demonstrating the impact 
a high-performance SDR platform can provide. KUAR 
is another SDR development platform [18]. Similar to 
WARP, KUAR mainly uses Xilinx FPGAs and PowerPC 
cores for signal processing. But it also contains an em- 
bedded PC as the control processor host (CPH), which 
has a 1.4GHz Pentium M processor. Therefore, it allows 
some communication systems to be implemented com- 
pletely in software on CPH. They have demonstrated 
some GNU Radio applications on KUAR. Sora provides 
the same flexibility and performance as hardware-based 
platforms, like WARP, but it also provides a familiar 
and powerful programming environment with software 
portability at a lower cost. 

The SODA architecture represents another point in 
the SDR design space [17]. SODA is an application 
domain-specific multiprocessor for SDR. It is fully pro- 
grammable and targets a range of radio platforms — four 
such processors can meet the computational require- 
ments of 802.1la and W-CDMA. Compared to WARP 
and Sora, as a single-chip implementation it is more ap- 
propriate for embedded scenarios. As with WARP, de- 
velopers must program to a custom architecture to im- 
plement SDR functionality. 


10 Conclusions 


This paper presents Sora, a fully programmable software 
radio platform on commodity PC architectures. Sora 
combines the performance and fidelity of hardware SDR 
platforms with the programmability of GPP-based SDR 
platforms. Using the Sora platform, we also present the 
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design and implementation of SoftWiFi, a software ra- 
dio implementation of the 802.1 1a/b/g protocols. We are 
planning and implementing additional software radios, 
such as 3GPP LTE (Long Term Evolution), W-CDMA, 
and WiMax using the Sora platform. We have started 
the implementation of 3GPP LTE in cooperation with 
Beijing University of Posts and Telecommunications, 
China, and we confirm the programming effort is greatly 
reduced with Sora. For example, it has taken one student 
only two weeks to develop the transmission half of LTE 
PUSCH(Physical Uplink Shared Channel), which can be 
a multi-month task on a traditional FPGA platform. 

The flexibility provided by Sora makes it a convenient 
platform for experimenting with novel wireless proto- 
cols, such as ANC [16] or PPR [15]. Further, being able 
to utilize multiple cores, Sora can scale to support even 
more complex PHY algorithms, such as MIMO or SIC 
(Successive Interference Cancellation) [23]. 

More broadly, we plan to make Sora available to the 
wireless networking research community. Currently, 
we are collaborating with Xi’an Jiao Tong University, 
China, to design a new MIMO RF module that supports 
eight channels. We are planning moderate production 
of the Sora RCB and RF modules for use by other re- 
searchers. The estimated cost for Sora hardware is about 
$2,000 per set (RCB + one RF front-end). We also plan 
to release the Sora software to the wireless network re- 
search community. Our hope is that Sora can substan- 
tially contribute to the adoption of SDR for wireless net- 
working experimentation and innovation. 
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Appendix A: SIMD example for FIR Filter 


In this appendix, we show a small example of how to 
use SSE instructions to optimize the implementation of a 
FIR (Finite Impulse Response) filter in Sora. FIR filters 
are widely used in various PHY layers. An n-tap FIR 
filter is defined as 


ult] = So cx eft — a, 
k=0 
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int. FirSSE ( PSAMPLE pSrc, 





























1 
| PSANPLE Output, 
; 3 int nSize, // number of complex samples 
' 4 P SHORT CCorft, // fiiter coerfs 
: Temporary 5 int iTaps, // the highest index of tap (n-1) 
‘ results 6 PSAMPLE pTempBuf, // for temp value store 
: 7 ) 
: 8 { 
' 9 _asm { 
: 10 mov eSi, psrc; 
8 11 mov ecx, nSize; 
-m | Cn-m+1|Cn-ms2 12 mov ebx, pOutput; 
13 outerloop: 
Crvateel| Gnems2| Crt 14 mov edx, pCoff; 
BEREAES Oe 
16 
17 ;// load samples 4-I and 4-Q 
a 18 movdqa xmm0, [esi]; 
. r 19 
Figure 16: Memory layout of the FIR coefficients. 20 ; // result_0 
21 movdga xmm4, xmm0; 
22 pmullw xmm4, [edx]; 
where x|.| are the input samples, y|.] are the output sam- 3 paddsw xmm4, [edi]; 
. : ; 24 2 Jf weenie 7 
ples, and cy are the filter coefficients. With SIMD in- _ .. eves cme? Sams 
structions, we can process multiple samples at the same 26 pmullw xmm5, [edx + 16]; 
: ‘ 27 paddsw xmm5, [edi + 16]; 
time. For example, Intel SSE supports a 128-bit packed- _ ee aoe 2 
vector and each FIR sample takes 16 bits. Therefore, 29 movdga xmm6, xmm0; 
f _ 8 1 1 ti ‘ It 1 30 pmullw xmm6, [edx + 32]; 
we can perform m = 8 calculations simultaneously. - Sara sine. Padi oo: 
To facilitate SSE processing, the data layout in mem- 32 ; // result_3 
: : 33 movdga xmm7, xmm0; 
ory should be carefully designed. Figure 16 shows the =, Seed ig, cae & Gi: 
memory layout of the FIR coefficients. Each row forms 35 paddsw xmm7, [edi + 48]; 
oe 36 
a packed-vector containing m components for SIMD op- : AT Beil. Se. wae, SE Ss Gueeas 
erations. The coefficient vector of the FIR filter is repli- 38 ; // perform shuffle and horizontal additions 
t di h 1 . . l t Th th tot l 39 movdqa xmml, xmm4; 
cated in each column in a zig-zag layout. Thus, the tota oe amide aa cane: 
number of rows is (n+ m-—1). There are alsontem- 41 punpckhdg xmm4, xmm6; 
; er 42 addsw xmm4, xmml; 
porary variables containing the accumulated sum up to _—,, 7 
each FIR tap for each sample. 44 movdga xmml, xmm5; 
é 45 punpckldq xmml, xmm/7; 
Figure 17 shows the example code. It takes an ar- peed aaa aa 
ray of input samples, a coefficient array, and outputs the 4] paddsw xmm5, xmm1; 
. . 48 
filtered samples in an output sample buffer. The input corde: seni, scene 
contains two separate sample streams, with the evenand 50 punpckldg xmml, xmm5; 
, : al punpckhdq xmm4, xmm5; 
odd indexed samples representing the J and Q samples, —,, pao on. dens 
respectively. The coefficient array is arranged similarly 53 
, ‘ , 54 a, ff output 
to Figure 16, but with two sets of FIR coefficients forJ —.. > Jf Abe eka Daseenee ieee may be Leeed ee 
and Q samples, respectively. 56 ; // adjust the sample orders 
: . 57 movdga [ebx], xmm4; 
Each iteration, four / and four Q samples are loaded : 
into an SSE register. It multiplies the data in each row ; // update temp buffers 
; 60 mov eax, iTaps; 
and adds the result to the corresponding temporal accu- gy i nnerioop: 
mulative sum variable (lines 59-68). A result is output 62 movdqa xmm1, xmm0; 
‘ . 63 pmullw xmml, [edx + 64]; 
when all taps are calculated for the input samples (lines, sacdeu simi, edi 4 64]. 
18-57). When the input sample stream is long, there are =» movdga [edi], xmml; 
. . . 66 
nm samples in the pipeline and m outputs are generated =. sea eae, We 
in each iteration. Note that the output samples may not 68 add edi, 16; 
‘ ‘ : 69 dec eax; 
be in the same order as the input — some algorithms do _, soe en loos: 
not always require the output to have exactly the same 71 
: d : 7 »// advance to next sample grou 
order as the input. A few shuffle instructions can be __,, hdd esi, 16; ead 
added to place the output samples in original order if 7% add ebx, 16; 
ded 75 sub ecx, 4; 
needed. 76 jg outerloop; 


78} 


Figure 17: Pseudo-code of SSE optimized FIR Filter. 
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Abstract 


Over the past few years a range of new Media Access 
Control (MAC) protocols have been proposed for wire- 
less networks. This research has been driven by the 
observation that a single one-size-fits-all MAC protocol 
cannot meet the needs of diverse wireless deployments 
and applications. Unfortunately, most MAC functional- 
ity has traditionally been implemented on the wireless 
card for performance reasons, thus, limiting the opportu- 
nities for MAC customization. Software-defined radios 
(SDRs) promise unprecedented flexibility, but their ar- 
chitecture has proven to be a challenge for MAC proto- 
cols. 

In this paper, we identify a minimum set of core MAC 
functions that must be implemented close to the radio 
in a high-latency SDR architecture to enable high per- 
formance and efficient MAC implementations. These 
functions include: precise scheduling in time, carrier 
sense, backoff, dependent packets, packet recognition, 
fine-grained radio control, and access to physical layer 
information. While we focus on an architecture where 
the bus latency exceeds common MAC interaction times 
(tens to hundreds of microseconds), other SDR architec- 
tures with lower latencies can also benefit from imple- 
menting a subset of these functions closer to the radio. 
We also define an API applicable to all SDR architectures 
that allows the host to control these functions, providing 
the necessary flexibility to implement a diverse range of 
MAC protocols. We show the effectiveness of our split- 
functionality approach through an implementation on the 
GNU Radio and USRP platforms. Our evaluation based 
on microbenchmarks and end-to-end network measure- 
ments, shows that our design can simultaneously achieve 
high flexibility and high performance. 


1 Introduction 


Over the past few years, a range of new Media Access 
Control (MAC) protocols have been proposed for use in 
wireless networks. Much of this increased activity has 
been driven by the observation that a single one-size- 


fits-all MAC protocol cannot meet the needs of diverse 
wireless deployments and applications and, thus, MAC 
protocols need to be specialized (e.g. for use on long- 
distance links, mesh networks). Unfortunately, the devel- 
opment and deployment of new MAC designs has been 
slow due to the limited programmability of traditional 
wireless network interface hardware. The reason is that 
key MAC functions are implemented on the network in- 
terface card (NIC) for performance reasons, which often 
uses proprietary software and custom hardware, making 
the MAC hard, if even possible, to modify. 

Software-defined radios (SDRs) have been proposed 
as an attractive alternative. SDRs provide simple hard- 
ware that translates signals between the RF and the digi- 
tal domains. SDRs implement most of the network inter- 
face functionality (e.g., the physical layer and link layer) 
in software and, as a result, they make it feasible for 
developers to modify this functionality. SDR architec- 
tures [19, 6, 17, 20, 9] typically distribute processing of 
the digitized signals across several processing units — in- 
cluding FPGAs and CPUs located on the SDR device, 
and the CPU of the host. The platforms differ in the pre- 
cise nature of the processing units that are provided, how 
those units are connected, and how computation is dis- 
tributed across them. 

Unfortunately, the high degree of flexibility offered 
by SDRs does not automatically lead to flexibility in the 
MAC implementation. The reason is that, in the SDR ar- 
chitecture we are addressing, the use of multiple hetero- 
geneous processing units with interconnecting buses, in- 
troduces large delays and jitter into the processing path of 
packets. Processing, queuing, and bus transfer delays can 
easily add up to hundreds of microseconds [14]. Unfor- 
tunately, the delay limits how quickly the MAC can re- 
spond to incoming packets or changes in channel condi- 
tions, and the jitter prevents precise control over the tim- 
ing of packet transmissions. These restrictions severely 
reduce the performance of many MAC protocols. 

This paper presents a set of techniques that makes it 
possible to implement diverse, high performance MAC 
protocols that are easy to modify and customize from the 
host. The key idea is a novel way of splitting core MAC 
functionality between the host processing unit and pro- 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 


91 


92 


USB Blocks 


samples 
1 


Radio Hardware FPGA ' Host 


RF Frontend 







15.6ns 


negligable (1/clock) 


CPU 


kernel 


15.6ns 25us 


(1/clock) 






Transmit 


2.4GHz IF ___ 







Receive 


25ms (1500bytes) 


Figure 1: Generic SDR Architecture 


cessing units on the hardware (e.g., FPGA). The paper 
makes the following contributions: 


e We identify a set of core MAC functions that must 
be implemented close to the radio for performance 
and efficiency reasons. 

e We define a split-functionality architecture that al- 
lows the functions to be implemented near the ra- 
dio hardware, while maintaining control on the host 
CPU through an API. 

e We present an implementation of our architecture 
using the GNU Radio [6] and USRP [17] SDR plat- 
form. We also use our implementation to charac- 
terize the performance-flexibility tradeoffs for key 
MAC features. For example, our results show 
three orders of magnitude greater precision for the 
scheduling of packets and carrier sense, along with 
a high level of accuracy in fast packet detection. 

e Finally, we use our implementation for an end-to- 
end evaluation of the split-functionality architec- 
ture. We show how the system can support diverse 
high-performance MAC implementations by imple- 
menting 802.11-like and Bluetooth-like protocols 
for experimentation over the air. 


The rest of the paper is organized as follows. We dis- 
cuss current radio architecture and its impact on MAC 
protocol development in Section 2. In Sections 3 and 
4, we explore the core MAC requirements and introduce 
our split-functionality architecture. Section 5 provides 
details for each component implementation with evalu- 
ation results. Finally, we present end-to-end evaluation 
results, related work, and a summary of our results in 
Sections 6 through 8. 


2 MAC Implementation Choices 


A number of different software-defined radio architec- 
tures have been developed. One common architecture 
is shown in Figure |. The frontend is responsible for 
converting the signal between the RF domain and an 
intermediate frequency, and the A/D and D/A compo- 
nents convert the signal between the analog and the dig- 
ital domain. Physical and higher layer processing of the 
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digitized signal are executed on one or more processing 
units. Typically, there is at least an FPGA or DSP close 
to the frontend. The frontend, D/A, A/D, and FPGA are 
usually placed on a network card that is connected to the 
host CPU by a standard bus (e.g., USB). 


The distribution of functionality across the processing 
units significantly impacts the radio’s performance, flex- 
ibility, and ease of reprogramming. To achieve a high 
level of flexibility and reprogramming, the majority of 
processing (i.e., modulation) can be placed on the host 
CPU where the functionality is easy to modify. We refer 
to this architecture as host-PHY. This architecture 1s ex- 
emplified by GNU Radio [6] and the USRP [17], which 
place the majority of functionality in userspace, shown 
in Figure |. For greater performance, processing can be 
implemented in the radio hardware on the FPGA or DSP. 
We refer to this architecture as NJC-PHY. The WARP 
platform [20] implements this architecture, placing the 
PHY and MAC layers on the radio hardware for perfor- 
mance reasons. It is fairly straightforward however, to 
parameterize PHY layers (e.g. to control the frequency 
band and coding an modulation options). Thus, it is pos- 
sible control many aspects of the PHY layer from the 
host, no matter where it is implemented. 


Unfortunately, MAC protocols are less structured and 
SDRs have fallen short in providing high-performance 
flexible MAC implementation. The MAC is either im- 
plemented near the radio hardware for performance, or 
near the host for flexibility. We propose a novel split of 
MAC functionality across the processing units in a host- 
PHY architecture such that we can achieve a high level 
of performance, while maintaining flexibility at both the 
MAC and PHY layers. This is especially significant in 
a host-PHY architecture, which has been considered in- 
capable of supporting even core MAC protocol functions 
(e.g., carrier sense) due to the large processing delays in- 
herent to the architecture [14, 18]. In addition, our design 
can enable many cross-layer optimizations, such as those 
proposed between the MAC and PHY layers [5, 8, 7]. 
Such optimizations have used the host-PHY architecture 
for easy PHY modifications, but given the lack of MAC 
support, they typically fake” the MAC layer (e.g., by 
combining the SDR with a commodity 802.11 NIC to do 
the MAC processing [5]) or omit it all together [7, 8]. 
Although our work focuses on a host-PHY architecture, 
several of the components we will present can be applied 
to a NIC-PHY architecture. 


In the next section, we explore delay and jitter mea- 
Surements in the host-PHY architecture, which are the 
major limiting factor on performance of MAC imple- 
mentations. The measurements are important in under- 
standing the proper split of MAC functionality across the 
heterogeneous processing units of an SDR. 
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Avg SDev Min Max 
User—> Kernel (us) 24 10 22 213 
Kernel—> User (Us) a 89 13 7000 
4096 Kernel<—>FPGA (Ls) | 291 62 204 360 


512 Kernel<—>FPGA (Us) 148 a2 90 193 
GNU Radio<—>FPGA (us) | 612 789 289 9000 


Table 1: Kernel level delay measurements. 


2.1 Delay Measurements 


Schmid et al [14] present delay measurement for SDRs 
and their impact on MAC functionality in a host-PHY 
architecture. However, they focus on user-level mea- 
surements, largely ignoring precise measurement of de- 
lays between the kernel and userspace, and kernel and 
the radio hardware. Such measurements are important, 
since they can provide insight into whether implementing 
MAC functions in the kernel is sufficient to overcome the 
performance problems associated with user level imple- 
mentations. To obtain precise user and kernel-level mea- 
surements, we modified the Linux kernel’s USB Request 
Block (URB) and USB Device Filesystem URB (US- 
BDEVFS_URB) to include nanosecond precision times- 
tamps taken at various times in the transmission and re- 
ceive process. All user level timestamps are taken in user 
space right before or after a URB is submitted (write) or 
returned (read). At the kernel level, the measurement is 
taken at the last point in the kernel’s USB driver before 
the DMA write request is generated, or after a DMA read 
request interrupts the driver. This is as close to the bus 
transfer timing as possible. 

We measured the round trip time between GNU Ra- 
dio (in user space) and the FPGA using a ping command 
on a control channel that we implement (Section 4.2). 
Using the measurements described above, we are also 
able to identify the sources of the delay by calculating 
the user to kernel space delay, kernel to user space de- 
lay, and round trip time between the kernel and FPGA 
based on ping. We ran the user process at the highest pri- 
ority to minimize scheduling delay. We used the default 
4096 byte USB transfer block size for all experiments, 
and then perform an additional kernel to FPGA RTT ex- 
periment using a 512 byte transfer block size, the mini- 
mum possible, in an attempt to minimize queuing delay. 

The results presented in Table | are averaged over 
1000 experiments. Focusing on the average times, we 
see the cost of a GNU Radio ping is dominated by the 
kernel-FPGA roundtrip time (291 out of 612 Us). The 
user-kernel and kernel-user times are relatively modest 
(24 and 27 Us). The remaining time (270 Us) is spent in 
the GNU Radio chain. The high latency of the kernel- 
FPGA roundtrip time is somewhat surprising, given that 
the effective measured rate of the USB with the USRP is 
32MB/s. The difference between the latencies for 4KB 


and 512B shed some light on this. The difference in la- 
tency is only a factor of two, suggesting that the set up 
cost for transfers contributes significantly to the delay. 
The kernel-FPGA time also includes the time it takes for 
the data to pass through the USRP USB FX2 controller 
buffers, and to be copied into the FPGA for parsing. The 
time taken for the data to pass through the USRP USB 
FX2 controller buffers and copied into the FPGA for 
parsing also contributes to the kernel-FPGA RTT. 

The standard deviations and the min/max values paint 
a different picture. The user-to-kernel and kernel-FPGA 
times fall in a fairly narrow range, so they only contribute 
a limited amount of jitter. The kernel-to-user times how- 
ever have a very high standard deviation, which results 
in a high standard deviation for the GNU Radio ping de- 
lays. This is clearly the result of process scheduling. 


2.2 MAC Design Space 


As discussed briefly in Section 2, the processing units 
in the above SDR architecture have very different prop- 
erties. Focusing on Figure |, the host CPU is easy to 
program and is readily accessible to users and develop- 
ers. However, the path between the host CPU and the 
radio front end has both high delay and jitter, as shown 
by the measurements presented in Section 2.1. The round 
trip times between the device driver on the host and the 
FPGA is about 300 Us for 4KB of data, with relatively 
modest jitter. The roundtrip from GNU Radio is about 
double, but with significantly more jitter. As a result, a 
host-based MAC protocol (be it in user space or in the 
kernel) will not be able to precisely control packet tim- 
ing, or implement small, precise inter-frame spacings, 
which will hurt the performance of many MAC proto- 
cols. We conclude that, time critical radio or MAC func- 
tions should not be placed on the host CPU. 

Processing close to the radio performed by a FPGA 
or CPU on the NIC has the opposite properties. It has a 
low latency path to the frontend (see USRP latencies in 
Figure 1), making it attractive for delay sensitive func- 
tions. Unfortunately, code running on the radio hardware 
is much harder to change because it is often hardware- 
specific and requires a more complex development envi- 
ronment. Moreover, history shows that vendors do not 
provide open access to their NICs, even if they are pro- 
grammable. Access to the processors on the NIC is re- 
stricted to its manufacturer and possibly large customers 
who can, under license, customize the NIC code. This 
is of course not a problem for research groups using 
research platforms, which is why many researchers are 
moving to software radios, but it is an important consid- 
eration for widespread deployment. We conclude that in 
order to be widely applicable, the control of flexible MAC 
implementations should reside on the host. 
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Interesting enough, the SDR NIC architecture in Fig- 
ure | is not unlike the architecture of traditional NICs 
(e.g., 802.11 cards). Today’s commodity NICs use ana- 
log hardware to perform physical layer processing, but 
they typically also have a CPU, FPGA, or custom proces- 
sor. These commodity devices exhibit the same tradeoffs 
we identified above for software radios: the delay be- 
tween the processing on the host and the (analog) fron- 
tend is substantially higher and less predictable than be- 
tween the NIC processor and the front end. 

Experience with commercial 802.11 cards supports 
the conclusions we highlighted above. First, time sen- 
sitive MAC functions such as sending ACKs are always 
performed on the NIC, and only functions that are not de- 
lay sensitive such as access point association are handled 
by the host processor. Moreover, although most of the 
MAC functionality on the NIC is implemented in soft- 
ware, it can only be modified by a small number of ven- 
dors (i.e. in practice the NIC is a black box). Researchers 
have had some success in using commodity cards for 
MAC research by moving specific MAC functions to the 
host [13, 16, 10, 15], but the results are often unsatis- 
factory. The host can only take control over certain func- 
tions (e.g. interframe spacings must be longer than 60 
microseconds), precision is limited (e.g. cannot elimi- 
nate all effects of jitter), and the host implementation is 
inefficient (as a result of polling) and is susceptible to 
host loads. 

The different properties of the host and NIC process- 
ing units means that the placement of MAC functional- 
ity will fundamentally affect four key MAC performance 
metrics, including network performance, flexibility in 
MAC implementation and runtime control, and ease of 
development. Unfortunately, as discussed above, these 
performance goals are in conflict with each other and 
achieving the highest level for each is not possible. In 
this paper, we present a split-functionality architecture 
that implements key MAC functions on the radio hard- 
ware, but provides full control to the host. This allows 
us to simultaneously score very high on all four metrics, 
and it also allows developers and users to make tradeoffs 
across the metrics. While developers will always have to 
make tradeoffs, the negatives associated with specific de- 
sign choices are significantly reduced in our design. Note 
that this does not imply that our design can support any 
arbitrary or even all existing MAC designs. However, we 
believe that it is capable of supporting most of the critical 
features of modern MAC designs. 

The focus of the paper is on SDR platforms be- 
cause they provide maximal flexibility in key research 
areas such as cross-layer MAC and PHY optimization 
(e.g.,[5, 7, 8]). Our evaluation is based on a platform that 
uses the host-PHY architecture, but is not critical. Even 
in NIC-PHY architectures that have good support for the 
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MAC on the NIC (e.g., in the form of a general-purpose 
CPU), it is important to maintain control over the MAC 
and PHY on the host to ensure easy customization. As 
a result, the techniques we propose can be useful across 
the entire spectrum of NIC designs. 


3 Core MAC Functions 


An ideal wireless protocol platform should support the 
implementation of well-known MAC protocols as well as 
novel MAC research designs. A study of current wireless 
protocols, including WiFi (both Distributed and Point 
Coordination Function), Zigbee, Bluetooth, and various 
research protocols shows that they are based on a com- 
mon, core set of techniques such as contention-based ac- 
cess (CSMA), TDMA, CDMA, and polling. In this sec- 
tion, we identify key core functions that a platform must 
implement efficiently in order to support a wide range of 
MAC protocols. 

Precise Scheduling in Time: TDMA-based protocols 
require precise scheduling to ensure that transmissions 
occur during time slots. Imprecise timing can be tol- 
erated by using long guard periods; however, this de- 
grades performance. Surprisingly, modern contention- 
based protocols also require precise scheduling to imple- 
ment inter-frame spacing (i.e. DIFS, SIFS, PIFS), con- 
tention windows, back-off periods, etc. 

Carrier Sense: Contention-based protocols often use 
Carrier sense to detect other transmissions. Carrier 
sense may use simple power detection (e.g., using sig- 
nal strength) or may use actual bit decoding. Network 
interfaces need to transmit shortly after the channel is 
detected to be idle. Additional delay increases both the 
frequency of collision and also the minimum packet size 
required by the network. 

Backoff: When a transmission fails in a contention- 
based protocol, a backoff mechanism is used to resched- 
ule the transmission under the assumption that the 
loss was caused by a collision. Backoff is related to 
precise scheduling, but focuses more closely on fast- 
rescheduling of a transmission without the full packet 
transmission process (e.g., modulation). 

Fast Packet Recognition: Many MAC performance 
optimizations could use the ability to quickly detect an 
incoming packet and identify that it is relevant to the lo- 
cal node in a timely and accurate manner. For example, 
detecting and identifying an incoming packet before the 
demodulation procedure can reduce resource use on the 
processing units and on the bus. 

Dependent Packets: Dependent packets are explicit 
responses to received packets. A typical example is con- 
trol packets that are associated with data packets, for 
example for error control (e.g., ACKs) or for improved 
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channel access (e.g., RTS/CTS). Network interfaces need 
to generate these packets quickly and transmit them with 
precise time scheduling relative to the previous packet. 

Fine-grained Radio Control: Frequency-hopping 
spread spectrum protocols such as Bluetooth and the re- 
cently proposed MAXchop algorithm [11] require fine- 
grained radio control to rapidly change channels accord- 
ing to a pseudo-random sequence. Similarly, recent de- 
signs [|] for minimizing interference require the ability 
to control transmission power on a per-packet basis. 

Access to physical layer information: Many MAC 
protocol optimizations could benefit from access to 
radio-level packet information. Examples include using 
a received signal strength indicator (RSSI) to improve 
access point handoff decisions and using information on 
the confidence of each decoded bit to implement partial 
packet recovery [7]. 


3.1 Implications 


While it is difficult to argue that this (or any) list of core 
functions is the correct one and is complete, we believe 
that it is sufficient to implement a broad range of inter- 
esting MAC protocols. To provide some degree of confi- 
dence in this statement, we describe our implementation 
of an 802.11-like CSMA protocol and a Bluetooth-like 
TDMA protocol using our framework in Section 6. As 
such, this is a reasonable first “toolbox” that MAC pro- 
tocol developers can extend over time. 


4 Split Functionality Architecture 


As discussed in Section 2, implementing flexible high- 
performance MAC protocols is challenging because the 
high delays and jitter between the host CPU and frontend 
affects the performance of the core MAC functions de- 
scribed in the previous section. For example, most proto- 
cols need either precise scheduling in time or dependent 
packets. However, the delays inherent in a host MAC im- 
plementation in the given SDR architecture would make 
these functions inefficient or ineffective. In this section, 
we first review the requirements associated with the core 
MAC functions identified above, and we then present an 
architecture that allows us to support high performance 
MACs while maintaining host control. 


4.1 Core Requirements 


Implementing the core MAC functions from Section 3 
raises three challenges. 

Bus delay: The delay introduced by transmission of 
data over the bus can be constant and predictable, de- 
pending on the technology. A constant delay is relatively 


easy to accommodate in supporting precision schedul- 
ing, as discussed in Section 5.1. However, the bus delay 
does impact the performance of carrier sense, dependent 
packets, and fast packet recognition. The effect of bus 
latency on performance for SDR NICs is discussed in 
previous work [14]. 

Queuing delay: The delay introduced by queues may 
be smaller than the bus transmission delay but has signif- 
icant jitter, which makes precision scheduling difficult, 
if not impossible. The jitter can modify the inter-packet 
spacing through compression or dispersion as the data is 
processed in the host and at the ends of the bus. In Sec- 
tion 5.1.2, we present measurements that show that this 
compression can be so significant in the given architec- 
ture that spacing transmissions by under Ims cannot be 
achieved reliably using host-CPU based scheduling. 

Stream-based architecture of SDRs: The frontend 
Operates on streams of samples, which can make fine- 
grained radio control and access to physical layer infor- 
mation from the host ineffective. The reason is that it 
adds complexity to the interaction between a MAC layer 
executing on a host CPU (or NIC CPU) and the radio 
frontend since it is difficult to associate control informa- 
tion or radio information with particular groups of sam- 
ples (e.g., those belonging to a packet). This problem 
consists of two components: (1) how to propagate in- 
formation within the software environment that performs 
physical and MAC layer processing, and (2) how to prop- 
agate the information between the host and the frontend, 
across the bus and SDR hardware. This first issue is 
being addressed in the GNU Radio design with the in- 
troduction of m-blocks [2], which is briefly discussed in 
Section 7, but we must address the second issue. 


4.2 Overcoming the Limitations 


We now present an architecture that overcomes the above 
limitations. The goal is to allow as much of the pro- 
tocol to execute on the host as possible to achieve the 
flexibility and ease of development goals, both of which 
are important to a wireless platform for protocol devel- 
opment, as identified in Section 2. However, we must 
ensure that the high latency and jitter between the host 
and radio frontend does not result in poor performance 
and limited control, the other two criteria in Section 2. 
This is done by introducing two architectural features, 
per-block meta-data and a control channel, shown in 
Figure 2. The novelty is not in the two new architectural 
features, but in how we use them to implement the core 
MAC functions (Section 3) in such a way that we main- 
tain flexibility, while increasing performance (Section 5). 
We first discuss both features in more detail. 

Per block meta-data: Enabling the association of in- 
formation with a packet is crucial to the support of nearly 
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Figure 2: Split SDR architecture. 


all of the core requirements in Section 3. Each packet is 
modulated into blocks of samples, for which we intro- 
duce per block meta-data. The meta-data stored in the 
header includes a timestamp (inbound and outbound), a 
channel flag (data/control), a payload length, and single 
bit flags to mark events such as overrun, underrun, or to 
request specific functions that we implement on the ra- 
dio hardware. We limit the scope of the meta-data to the 
minimum needed to support the core requirements, thus 
minimizing the overhead on the bus. 

Control Channel: The control channel allows us 
to implement a rich API between the host and radio 
hardware and allows for less frequent information to be 
passed. It consists of control blocks that are interleaved 
with the data blocks over the same bus. Control blocks 
carry the same meta-data header as data blocks but have 
the channel field in the header set to CONTROL. The 
control block payload contains one or more command 
subblocks. Each subblock specifies the command type, 
the length of the subblock, and information relevant to 
the specific command (e.g., a register number). Exam- 
ples of commands include: reading or writing configu- 
ration registers on the SDR device, changing the carrier 
frequency, and setting the signal sampling rate. 

With these two features, we can effectively partition 
the core MAC functions into a part that runs on the radio 
hardware close to the radio frontend, and a control part 
that runs on the host. Of course, meta-data and control 
channels are used in many contexts. The contribution lies 
in how we use them to partition the core MAC functions, 
which is the focus of the next section. 


5 Core Component Design and 
Evaluation 


We now examine how the split-functionality approach 
can be used to implement the core functions described 
in Section 3. We also evaluate the performance of the 
implementation of each core function. We focus our dis- 
cussion on the GNU Radio and USRP platform. 
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5.1 Precise Scheduling in Time 


Precision scheduling needs to be implemented close to 
the radio to achieve the fine-grained timing required for 
TDMA, spread spectrum, and contention based proto- 
cols. This is especially important when a large amount of 
jitter exists in the system from multiple stages of queuing 
and process scheduling, explored in Section 2.1. 

For nodes to synchronize to the time of a global ref- 
erence point, such as a beacon transmission for synchro- 
nization to the start of a round in a TDMA protocol, the 
nodes need to accurately estimate the reference point. 
Jitter at the transmitter can cause the actual transmission 
of the beacon to vary from its target time by 0,, the maxi- 
mum transmission jitter. Moreover, the estimated time of 
the beacon transmission as a global reference point will 
vary by 0,, the maximum reception jitter. The maximum 
error is therefore 6, + 0,, which defines the minimum 
guard time needed by a TDMA protocol. By minimiz- 
ing 0, and 6,, we increase channel capacity. 


5.1.1 Precision Scheduling Design 


Our delay measurements in Section 2.1 suggest that 
much of the delay jitter is created near the host. There- 
fore, the triggering mechanism for packet transmissions 
should reside beyond the introduction of the jitter. Like- 
wise, to obtain an accurate local time at which a recep- 
tion occurs, the time should be recorded prior to the in- 
troduction of the jitter on the RX path. To enable preci- 
sion scheduling, we use a free running clock on the radio 
hardware to coordinate transmission/reception times as 
follows. 

Transmit: To reduce the transmission jitter (0;), we 
insert a timestamp on all sample blocks sent from the 
host to the radio hardware. When the radio hardware 
receives the sample block, it waits until the local clock 
is equal to the timestamp value before transmitting the 
samples. This allows for timing compression or disper- 
sion of data in the system with no effect on the preci- 
sion scheduling of the transmission. The host must en- 
sure the transmission reaches the radio hardware before 
the timestamp is equal to the hardware clock, else the 
transmission is discarded. The host is notified on failure, 
which can be treated as notification to schedule transmis- 
sions earlier. To support traditional best-effort streaming, 
we use a Special timestamp value, called NOW, to trans- 
mit the block immediately. 

In practice, the samples for a packet will be frag- 
mented across multiple blocks. To make sure that a sin- 
gle packet’s transmission is continuous and that if the 
packet is dropped all fragments are dropped, we imple- 
ment start of packet and end of packet flags in the block 
headers. The first block carrying the packet will have the 
start of packet flag set and the timestamp for transmis- 
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Figure 3: Evaluation setup using 3 USRPs. 


sion. All remaining blocks carry a timestamp value of 
NOW to ensure continuous transmission. The hardware 
detects the last fragment using the end of packet flag, and 
can also report underruns to the host by detecting a gap 
between fragments. 

A common solution to achieve precise transmission 
spacing from the host is to leave the transmitter enabled 
at all times and space transmissions with 0 valued sam- 
ples. This solution is inefficient since it wastes both host 
CPU cycles and bus bandwidth, and it does not eliminate 
jitter on the receive side. 

Receive: To reduce the receiver jitter (6,), the radio 
hardware timestamps all incoming sample blocks with 
the radio clock time at which the first sample in the block 
was generated by the ADC. Given that the sampling rate 
is set by the host, the host knows the exact spacing be- 
tween samples. It can therefore calculate the exact time 
at which any sample was received, eliminating 6, and al- 
lowing for full synchronization between transmitter and 
receiver. 


5.1.2 Precision Scheduling Evaluation 


To evaluate precision scheduling, we compare the 
timestamp-based release of packets using the split- 
functionality approach with a timer-based implementa- 
tion in GNU Radio and in the kernel. We enable the real- 
time scheduling mechanism, which sets the GNU Radio 
processes to the highest priority. Our experiment trans- 
mits a frame used as a logical time reference, and then at- 
tempts to transmit another frame at a controlled spacing 
over the air. With no error, the actual spacing over the air 
is equal to the targeted spacing. We measure the actual 
spacings achieved using a monitoring node (Figure 3). A 
USRP on the monitoring node measures the magnitude 
of received complex samples at 8 megasamples per sec- 
ond, resulting in a precision of 125 nanoseconds. With 
no transmission jitter (0,), the spacing between beacons 
will exactly match their transmission rate, while any vari- 
ability in scheduling will affect the spacings. The nodes 
are connected via coaxial cable to avoid the impact of 
external signals. 

We compare the measured spacing of 50 transmis- 
sions with targeting spacings from 100ms to lus. Fig- 
ure 4 shows the host and kernel based implementations 
to have approximately Ims and 35s of error, respec- 
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Figure 4: Split-functionality vs. host scheduling. 


tively. The timestamp-based mechanism achieves exact 
spacing to our monitoring node’s precision. Therefore, 
moving timestamps to the kernel improves accuracy, but 
the error is still at least an order of magnitude greater 
than in the split-functionality design. Section 6.1 quan- 
tifies the benefits further through the implementation of 
a Bluetooth-like TDMA protocol. In the evaluation, we 
also measure 06, with the split-functionality approach to 
be within 312ns. The average results show one-sided er- 
ror, illustrating that compression of data across the bus 
dominates over dispersion. This is likely due to the mul- 
tiple stages of buffers, including the buffers on the radio 
hardware to read the data from the FX2 controller. While 
dispersion is recorded, it occurs infrequently. 


5.2 Carrier Sense 


The performance of carrier sense is crucial to CSMA 
protocols: the longer it takes to transmit a packet after 
the channel goes idle, the greater the chance of colli- 
sion. This turnaround time is referred to as the carrier 
sense ‘blind spot” by Schmid et al. [14]. This blind spot 
has 4 components: signal propagation delay, the delay 
between the radio hardware and host for incoming sam- 
ples, the processing delay involved in carrier detection at 
the host, and the complete transmission delay once the 
medium is detected idle at the host; this includes mod- 
ulation of a packet and transferring the samples to the 
radio hardware for transmission. 


5.2.1 Carrier Sense Design 


To significantly reduce the size of the carrier sense blind 
spot, we must avoid the associated delays by placing the 
decision at the radio hardware. However, the decision 
process should be controlled by software running on the 
host CPU to maintain flexibility. The first assumption we 
can make is that if carrier sense is to be performed, the 
host has data to transmit and can modulate it and pass 
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it to the radio hardware to pend on carrier sense. The 
per block meta-data for the transmission has a single bit 
flag set to indicate the block should be held until there 
is no carrier using a locally computed RSSI value. The 
host can control the carrier sense threshold via the con- 
trol channel. We use an RSSI value recorded in the radio 
hardware to implement a simple RSSI threshold carrier 
sense mechanism. 


5.2.2 Carrier Sense Evaluation 


We now present an evaluation of the carrier sense com- 
ponent in comparison to performing carrier sense at the 
host. In the host implementation, the received signal 
strength is estimated from the incoming sample stream 
and uses thresholds to control outgoing transmissions. 
We use the evaluation setup in Figure 3, described in 
Section 5.1.2, to achieve a 125 nanosecond resolution 
in measuring the archived carrier sense blind spot. The 
two contending nodes exchange the channel using car- 
rier sense 100 times and we measure the spacing be- 
tween each transmission, as illustrated in Figure 5. The 
first contending node, C/, finishes transmission 7'X,,, and 
C> takes 7; time to detect the channel as idle and be- 
gin transmission 7X,,,,. TJ, represents the carrier sense 
turnaround time, or blind spot. 

We plot two example channel exchanges using both 
implementations in Figure 6. Time is relative in the fig- 
ure and we align the contending node’s end of transmis- 
sion at time 100. We highlight the gap in both implemen- 
tations, and present the average gap observed across 100 
exchanges: 1.5us and 1.98ms for the split-functionality 
and host implementations, respectively. The host based 
latency could be reduced closer to Ims, or on the order 
of tens of microseconds, by splitting the functionality to 
the USRP device driver, or the kernel, respectively. In 
our evaluation, the times were recorded at a higher-level 
block in GNU Radio where a MAC protocol would re- 
side. These measurements illustrate our design’s abil- 
ity to reduce the carrier sense blind spot by three orders 
of magnitude, while maintaining host control on a per- 
packet basis. This can significantly increase the capac- 
ity in the channel by reducing the time it takes to detect 
it is idle. The host can even control the threshold on a 
per-packet basis by placing a control packet with a new 
threshold on the bus before the data packet. 
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Figure 6: Measured carrier sense blind spots. 


5.3. Backoff 


In contention based protocols, backoff is used to reduce 
collisions and increase fairness. Although the technique 
varies by protocol, a common implementation is to re- 
duce collisions by forcing a transmission delay and to 
increase fairness by making this delay random. The 
various delay components in SDRs prevent fine-grained 
backoff at the host. As shown in Section 5.1, a host 
backoff of less than Ims is unachievable and values be- 
tween Ims and 100ms would be unpredictable. There- 
fore, backoff at the host would require a large minimum 
backoff time, which decreases channel capacity. 

Despite our timestamping mechanism achieving mi- 
crosecond level accuracy (Section 5.1.2), such a mecha- 
nism alone is insufficient. If a new backoff time is to be 
computed once a failure is reported to the MAC on the 
host, the retransmission would incur at least a radio-to- 
host RTT after the previous transmission, meaning the 
minimum backoff in a host implementation is an RTT. 
The average RTT measured in Section 2.1 was 612uUs 
with a standard deviation was 789uUs and a maximum 
observed value of 9ms. This is insufficient by current 
protocol standards. Placing the backoff algorithm on the 
radio hardware would require developers to make low 
level changes. We therefore explore a split-functionality 
approach for backoff. 


5.3.1 Backoff Design 


To enable flexible fine-grained backoff we build upon 
the precision scheduling mechanism (Section 5.1) to in- 
troduce a technique that leaves the backoff algorithm 
and computations at the host, and the actual transmis- 
sion delay on the radio hardware. The key observation 
that enables our technique is that all backoff times, from 
the initial transmission ng tO NMAX _RETRIES; Can be pre- 
calculated by the host. The host calculates the backoff 
time for transmission no, and then assuming failure cal- 
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culates all remaining backoffs from 1 to MAX_RETRIES, 
including each in the per packet meta-data. 

A flag is set in the per block meta-data for the radio 
hardware to interpret the timestamp value as the maxi- 
mum number of retries (V/), and the first M 32-bit words 
pre-pended in the data payload to be interpreted as back- 
off times for each retransmission. Each value is inter- 
preted as a time-to-wait, where the transmission is sched- 
uled at current_clock+backoff. Moreover, we implement 
a control channel command that allows the host to con- 
figure the interpretation of a backoff value as an absolute 
time-to-wait, or a channel idle time-to-wait (most com- 
mon). 

This technique does not affect scheduling of future 
transmissions, as for example in 802.11 the contention 
window is reset to the minimum on a successful trans- 
mission. This means that the host can fully schedule a 
transmission and before a success/failure notification 1s 
given by the hardware, it can prepare the next transmis- 
sion and buffer it on the radio hardware. 


5.3.2. Backoff Evaluation 


Given that the backoff technique uses the precision 
scheduling mechanism, its accuracy is the same as the 
precision scheduling mechanism and on the order of mi- 
croseconds. We also use the backoff technique in our 
split-functionality 802.11-like protocol evaluation found 
in Section 6. 


5.4 Fast Packet Recognition 


Traditional software-defined radios, in the receive state, 
stream captured samples at some decimated rate between 
the radio hardware and the host. For many MAC pro- 
tocols, such as CSMA-style designs, the radio cannot 
determine when packets for the attached node will ar- 
rive. As a result, the radio must remain in the receiving 
state. The downside to this is that the demodulation pro- 
cess uses significant memory and processor resources de- 
spite the fact that incoming packets destined for the radio 
are infrequent. As such radios become more ubiquitous 
and common for implementation, resource usage will 
become increasingly important, especially for energy- 
constrained devices such as the battery-powered Kansas 
University Agile Radio [9]. 

One simple solution would be to send samples when 
the RSSI is above some threshold. However, this does 
not filter out transmissions destined to other hosts and 
external signals. A better solution would be to have 
the radio hardware look for the packet preamble and 
the destination address, then transfer a maximum packet 
size worth of samples to the host after any match. At 
first glance, it may seem that fast packet recognition 
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Figure 7: Matched filter & dependent packet design. 


is not a “necessary” function for implementing MAC 
protocols, especially since the CPU and bus bandwidth 
resource consumption can become insignificant rather 
quickly (1.e., due to Moore’s Law). However, trends in 
bus delay do not have this same property. As we will dis- 
cuss further in Section 5.5, the ability to identify packets 
and process them partially on the SDR hardware is crit- 
ical to supporting low-latency MAC interactions (e.g., 
packet/ACK exchanges or RTS/CTS) in a high-latency 
architecture. 


5.4.1 Fast Packet Recognition Design 


Our goal is to accurately detect packets at the radio hard- 
ware without demodulating the signal (to keep flexibil- 
ity), for which we perform signal detection. The most 
relevant work in signal detection comes from the area of 
radar and sonar system design. From this area, we bor- 
row a well-known technique, called a matched filter, to 
detect incoming packets at the radio hardware without 
the demodulation stage. For the purpose of design dis- 
cussion, we refer to the bottom half of Figure 7. 

Matched filter: A matched filter is the optimal lin- 
ear filter that maximizes the output signal to noise ratio 
for use in correlating a known signal to the unknown re- 
ceived signal. For use in packet detection, the known 
signal would be the time-reversed complex conjugate of 
the modulated framing bits. This known signal is stored 
as the coefficients of the matched filter (Figure 7). The 
received sample stream is convolved with the coefficients 
to perform cross-correlation, where the output can be 
treated as a correlation score between the unknown and 
known signals. The correlation score is then compared 
with a threshold to trigger the transfer of samples to the 
host. The matched filter is flexible to different modula- 
tion schemes (e.g., GMSK, PSK, QAM), but requires a 
Fast Fourier transform for OFDM, given that the sym- 
bols are in the frequency domain. This would require an 
FFT implementation on the radio hardware. 

To also detect that the frame is destined to the par- 
ticular host, two different methods that have mathemat- 
ically different properties can be used. Single Stage: 
Use a frame format where the destination address is the 
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first field after the framing bits, and use this complete 
modulated sequence as the matched filter coefficients. 
Dual Stages: detect the framing bits first, then change the 
coefficients to the modulated destination address. Our 
implementation uses the single stage approach for sim- 
plification. However, a dual stage is more appropriate 
for monitoring multiple addresses such as a local address 
and a broadcast address. 


5.4.2 Fast Packet Recognition Evaluation 


We evaluate the effectiveness of the matched filter at de- 
tecting incoming sequences using simulations where we 
can control the noise level. Results are presented from 
over the air experiments with the presence of interfer- 
ence, multipath, and fading in Section 5.5. 

To evaluate the effectiveness of the matched filter with 
varying signal quality, we first run experiments with 
controlled signal-to-noise ratios (SNR) using the GNU 
Radio software. We introduce additive white Gaussian 
noise (AWGN) to control the SNR in terms of dB: 


PoWET signal 


SNR(dB) = 10 «logo * (1) 


PoweP noise 
To introduce noise, we compute the noise power based 


on the specified snr and power in the signal: 


SNR = 10(877/10) 


, 2 
PoweF signal = | Signalampi| 


PoWeF signal 


Power noise = — wR 


For evaluation, 1000 frames of 1500 bytes are encoded 
using the Gaussian minimum-shift keying (GMSK) mod- 
ulation scheme. These frames are used as the ground 
truth and mixed with the noise. We require that the 
matched filter detect the framing bits and that the trans- 
mission is destined for the attached host using the single- 
stage scheme (Section 5.4.1). The success rate is defined 
as the number of detected frames over the total number 
of frames in the dataset (1000). For comparison, we also 
include the success rate of the full GMSK decoder. At 
a high noise level, even the full decoder will fail at de- 
tecting the frames. The success rate, as a function of 
the SNR, is shown in Figure 8. The results show that 
the matched filter can detect the frames at a much higher 
success rate than the decoder can, even at low SNR levels 
where the noise power is greater than the signal power. 

Given these results, and further real-world results 
presented in Section 5.5, we conclude that using the 
matched filter for detecting relevant packets is accurate 
enough that the host will never miss an actual frame due 
to the filter. In fact, the filter triggering samples to the 
host can been seen from a different perspective as pro- 
viding further confidence to the host that there 1s actually 
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Figure 8: Success rate of the matched filter. 


a frame within the sample stream. The host could then 
perform additional processing in an attempt to decode 
the frame successfully. 


5.5 Dependent Packets 


Dependent packets are packets generated in response to 
another packet (e.g., an ACK or RTS packet). MAC pro- 
tocols often leave the channel idle during the dependent 
packet exchanges such as RTS-CTS and data-ACK ex- 
changes. As a result, reducing the turnaround time of 
such exchanges can significantly increase overall capac- 
ity. In a host-based MAC, three sources contribute to the 
delay associated with dependent packet generation: bus 
transmission delay, queuing delay, and processing time. 
In this section, we explore the use of a matched filter 
along with additional techniques for triggering depen- 
dent packet responses on the radio hardware. The tech- 
nique minimizes processing time by placing the packet 
detection as close to the radio as possible and avoids 
bus transmission and queuing delays by triggering a pre- 
modulated packet stored on the radio hardware. 


5.5.1 Decoding Delay at the Host 


We begin by quantifying the processing delay associated 
with host-based dependent packet generation. Note that 
we have already quantified bus delays in Section 2.1. We 
measure decode time for various frame sizes at the max1- 
mum supported decoding rate of the USRP: 2Mbps. The 
larger frame sizes would be representative of process- 
ing time for data/ACK exchanges, and the smaller frame 
sizes for RTS/CTS exchanges. 

We use two 3.0GHz Pentium 4 machines running 
GNU Radio with their USRPs transmitting/receiving us- 
ing the GMSK modulation scheme. Using host based 
timers, we record the minimum, average, and maxi- 
mum time to decode 6 different frame sizes seen in Fig- 
ure 9. The average decoding time is close to the mini- 
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Figure 9: Decode times for various frame sizes. 


mum recorded times for each frame size, however, rather 
large delays can be experienced at each frame size, likely 
due to the jitter introduced by queuing delays and pro- 
cess scheduling. Therefore, if one were to implement 
the matched filter at the radio hardware to detect in- 
coming dependent packets and generate responses, any- 
where from several milliseconds to 70 milliseconds can 
be saved solely in host processing. 


5.5.2 Generating Fast-Dependent Packets 


As an optimization to circumvent the decoding delays 
described, we develop a mechanism for fast-dependent 
packet generation in the radio hardware. This is not nec- 
essarily limited to host-PHY architectures. Although bus 
delay is reduced in N/JC-PHY architectures, they typi- 
cally use slower processors that increases decoding de- 
lays. Fast-dependent packet generation has three stages: 
(1) fast-packet detection of the initiating packet (e.g., 
RTS), (2) conditionals specific to the protocol that trig- 
ger the dependent packet, and (3) transmission of a pre- 
modulated dependent packet. We discuss stages 2 and 
3 in this section. Stage 1 was detailed in Section 5.4, 
although it is important to point out that by running mul- 
tiple matched filters in parallel, it is possible to detect and 
respond to different initiating packets. 

Stage 2: To introduce protocol dependent behavior 
after stage | detects the initiating packet and its end 
of transmission (the incoming signal drops to the noise 
floor), protocol developers can introduce a set of condi- 
tionals that control when a dependent packet is gener- 
ated. In our current implementation this must be written 
in a hardware description language (Verilog), which has 
primitives similar to those in C/C++ (e.g., if, else, case, 
etc.). A simple example is the conditional for generating 
a CTS in Verilog. It checks that the receiver and channel 
are idle: if(/receiving && RSSI < carrier_sense_thresh). 

A more interesting example is the fast-ACK genera- 
tor developed for our 802.11-like protocol (Section 6.3). 


We write 3 simple conditional statements around an SNR 
value. If any of the conditionals pass during the transmis- 
sion, the radio hardware concludes that the host would 
not have been able to decode the packet, and a fast-ACK 
should not be triggered. The following are the 3 condi- 
tionals, with reasons as to why the fast-ACK should not 
be generated based on the conditional passing. (1) if(SNR 
< lowest_thresh): interference throughout the transmis- 
sion. (2) if(last.SNR_val - SNR < drop_thresh): interfer- 
ence at the tail of the transmission, or fading. (3) if(SNR - 
last.SNR_val > increase_thresh): interference at the head 
of the transmission, or multipath. The technique is illus- 
trated in the overall system in Figure 7, where the cor- 
relation threshold for a data packet raises a signal which 
streams the samples to the SNR monitor. The final con- 
ditional is to detect the carrier as idle; then the fast-ACK 
is generated. 

Stage 3: To satisfy fast-dependent packet generation, 
the dependent packet must be pre-modulated and stored 
on the radio hardware, for which we provide a mech- 
anism on the control channel. Pre-modulation restricts 
the dependent packet to not contain fields dependent on 
the initiating packet (e.g., a MAC address). However, it 
still permits many dependent packets like those in cur- 
rent protocol standards (e.g., ACKs, RTS/CTS). For ex- 
ample, despite 802.11’s requirement for a destination ad- 
dress in an ACK packet, we can still develop and evaluate 
an 802.11-like protocol where senders assume the desti- 
nation of the ACK based on data transmissions. We re- 
mind the reader that a goal of our work is to enable MAC 
implementations and building blocks for novel MAC de- 
signs, not to necessarily support every current protocol 
to its specification. Future work could be in the de- 
velopment of a technique which extracts part of an in- 
coming signal (e.g., destination address) and then per- 
forms additional processing to use this raw signal in a 
pre-modulated dependent packet. This would essentially 
enable dynamic fast-dependent packets, without the in- 
teraction of the host. We do not explore this in the scope 
of our work. 

Fast-Dependent Packet Evaluation: To illustrate the 
fast-dependent packet generator, we evaluate an imple- 
mentation of the fast-ACK generator outlined in the de- 
scription of stage 2. First, we use the control channel to 
setup a matched filter which detects the framing bits and 
the attached node’s address (satisfying stage 1). Then, 
we pre-modulate an ACK that uses the broadcast address 
as the destination address for all active nodes to parse it 
(satisfying stage 3). 

To evaluate the SNR monitoring technique, and fur- 
ther evaluate the matched filter’s ability to detect packets 
in a real world scenario, we use a 2 USRP-node setup 
in the ISM band for presence of 802.11 and Bluetooth 
devices, incorporating real world interference in our re- 
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sults. We detected 6 active 802.11 devices within inter- 
ference range, but ensured that none were within 40 feet 
of either node. To test in adversarial conditions with mul- 
tipath interference, the two USRPs were placed in sepa- 
rate rooms with no direct line of sight. The matched fil- 
ter and fast-ACK technique are enabled at the receiver, 
for which we transmit 10000 frames to at 1Mbps. These 
frames are considered the ground truth for the matched 
filter, which we are trying to determine the accuracy of in 
detecting the frames. Full decoding of the data packets at 
the host is used as the ground truth for the fast-ACK gen- 
erator. If the full decoder successfully decodes the frame, 
and the SNR monitor triggers a fast-ACK, it is consid- 
ered success. If the SNR monitor chose to not generate 
a fast-ACK in this scenario, it is considered failure. An 
additional failure scenario is triggering a fast-ACK when 
the host could not decode the frame. 

For the 10000 frames transmitted, we find that the 
matched filter is able to detect the transmissions with 
100% success rate, reinforcing the simulation results 
from Section 5.4.2 with real world signal propagation 
properties. Of the 10000 frames, 460 transmissions were 
not decodable. Using the SNR monitoring technique we 
detect 457 of the corrupted frames for a failure rate of 
0.6%. Inspection of the 3 misses could not determine 
the cause of transmission failure. The error rate of not 
generating an ACK, when one should have been, is 4%. 

There are implications to incorrectly generating 
ACKs, which the MAC can be designed to recover from, 
or higher layers such as TCP can be relied on. Our eval- 
uation further explores the matched filter’s accuracy and 
illustrates the ability to implement fast-dependent pack- 
ets. Reducing the error rates seen by our technique is 
future work, either by improving the SNR monitoring 
technique, or introducing other fast- ACK techniques. An 
example for improvement would be detecting multipath 
during SNR monitoring, which is a property that can re- 
duce decoding probability. 


5.6 Access to Physical Layer Information 
and Fine-grained Radio Control 


The underlying radio hardware in an SDR platform has 
many controls that are not configured by the transmitted 
sample stream (e.g., transmission frequency and power), 
and can make many observations that are not easily de- 
rived from the input sample stream (e.g., RSSI). We use 
our control channel between the SDR hardware and host 
to expose these controls and physical layer information 
to the MAC protocol implementation. Many existing net- 
work interface use similar designs for setting the trans- 
mission channel and obtaining RSSI measurements. One 
key difference is that our interface operates on blocks of 
samples instead of packets. 
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Physical Layer Information: Access to physical layer 
information at all other layers in the processing chain is 
important for supporting common cross-layer optimiza- 
tions. This can be seen through recent work where per-bit 
confidence levels are used to perform partial packet re- 
covery [7]. In our design, information from the SDR can 
be sent to the host using either the control channel or per 
block meta-data. We use this mechanism to report RSSI 
to the host. Note that the host could calculate RSSI us- 
ing the raw samples, but an RSSI value which takes into 
account the gain or attenuation in the RF stages is only 
available at the radio hardware. The control protocol is 
easily modified to support reporting additional proper- 
ties, however, developers must reprogram the FPGA to 
report the desired values. 


Radio Control: We implement a set of radio hardware 
control messages on the control channel (Section 4.2) 
that can be synchronized with packet transmissions us- 
ing the timestamp. For example, by placing a control 
block with a timestamp T before a data packet on the 
bus, which uses a NOW timestamp, the radio will be re- 
configured at time 7 and the data packet will be trans- 
mitted immediately after the reconfiguration. This can 
be used to implement common techniques such as rapid 
frequency hopping. Unfortunately on the USRP, the 
daughterboards are tuned directly from the FX2 USB 
controller using the LC bus, which has no connection 
to the FPGA. Therefore, we cannot issue daughterboard 
commands from the FPGA using the control channel and 
hardware clock to implement rapid frequency hopping. 
The USRP2 tunes the daughterboards directly from the 
FPGA. Therefore, if our design was implemented on the 
USRP2, unavailable at the time, rapid frequency hopping 
could be achieved. 


6 MAC Evaluation 


We now provide end-to-end results for a Bluetooth-like 
TDMA protocol and 802.11-like CSMA protocol. The 
protocols use the split-functionality design described in 
Section 5 and we compare their performance with that of 
full host-based implementations. 


6.1 Bluetooth-like TDMA Protocol 


To illustrate the effectiveness of the overall system 
design, we implement a tightly timed Bluetooth-like 
TDMA protocol. Like Bluetooth, the network (piconet) 
consists of a master and a maximum of 7 slaves. The 
slaves communicate with the master in a round-robin 
fashion within a slot time of 625us. Unlike Bluetooth, 
our protocol fixes its frequency instead of hopping (a 
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limitation of the USRP discussed in Section 5.6), varies 
slightly in synchronization (bypasses pairing), and the 
slot guard time is varied for evaluation. 

Each slave in the network synchronizes with the start 
of a round by listening for the master’s beacon, and cal- 
culates the start of transmission (Section 5.1) as the log- 
ical synchronization time 7. The beacon frame also 
carries the total number of registered slaves (NV) and 
the guard time (7,). The slave can then compute the 
total round time, which must account for the master: 
T. =N+1*(T;+T,), where T; is the slot time (625s). 
The start of round k is computed as: 7, = T +T7;.«k. We 
remind the reader that this is a logical time kept at each 
node, taken from the beacon frame which is a global ref- 
erence point. Global hardware clock synchronization is 
explored in Section 6.2. Finally, each slave’s slot offset is 
computed from its node ID (n), 6, =n* (T; +T,), which 
is then used to compute the local start time of slave n’s 
slot in round k: T),(4) = Rx + On- 


6.1.1 TDMA Results 


We use two metrics in our evaluation: ability to main- 
tain tight synchronization and overall throughput. The 
synchronization error at the master is 15ns, computed by 
measuring the actual spacing of 1000 beacons using a 
monitoring node (discussed in Section 5.1.2). This il- 
lustrates the tight timing of the master’s beacon trans- 
missions. To measure the synchronization error at the 
slaves, we record the calculated timestamps of 1000 bea- 
cons at 4 slaves. Each timestamp should be exactly T, 
apart from the next. The absolute error in spacing rep- 
resents shifts in the slave’s calculation of the start of the 
round. We find the maximum error of the 1000 beacons 
at all 4 slaves to be 312 nanoseconds, with an average 
of 140ns. This answers the question of our platform’s 
ability to obtain tight synchronization at both transmit- 
ters (master) and receivers (slaves). 

We compare a split-functionality implementation to a 
host implementation, which differ in their guard times. 
A guard time of lus is used for the split-functionality 
implementation, which is nearly 3 times the maximum 
error. We use our round trip host and radio hardware 
delay measurements from Section 2.1, which accounts 
for both transmissions and reception timing variability, 
to estimate the host guard time needed. A guard time of 
9ms would be needed to account for the maximum er- 
ror, however, this delay occurs rarely and we therefore 
present results using a generous guard time of 3ms (ap- 
proximately 3 « sdev) and a more realistic guard time of 
6ms based on our recorded delay distribution. 

We perform 1O00KB file transfers, varying the num- 
ber of registered slaves and presenting averaged results 
across 100 transfers in Figure 10. The split-functionality 


1us guard time (timestamp) —-— 
3ms guard time (host) ---x--- 
6ms guard time (host) ---x--- 


Average Throughput (Kbps) 





number of registered slaves (N) 


Figure 10: TDMA throughput comparison results. 


implementation is able to achieve an average of 4 times 
the throughput of the host based implementation. While 
we had only been able to answer the question of ob- 
taining synchronization, we find that throughout the full 
transfers no slave drifts into another slot period using 
only the initial beacon for synchronization, illustrating 
the ability to maintain tight synchronization. These re- 
sults are promising for the development of TDMA pro- 
tocols on the platform. 


6.2 Additional TDMA Protocols 


Another common TDMA implementation is the use of 
global clock synchronization. We extend the Bluetooth- 
like protocol to use global clock synchronization on the 
platform rather than the logical clock. The implementa- 
tion design is as follows. The global clock in the network 
is the clock of the master, to which all slaves synchronize 
via beacon frames. In addition to the information sent 
in each beacon frame described in Section 6.1, the mas- 
ter includes the timestamp at which the beacon 1s locally 
scheduled for transmission. 

For global synchronization, the slave takes its esti- 
mated local time of the master’s beacon transmission 
and subtracts the incoming global clock timestamp in- 
cluded in the beacon to calculate 6, the local clock offset 
from the master. The error is within 312ns plus over- 
the-air propagation delay. The MAC framework can now 
synchronize to the global clock with a command packet 
(Section 4.2) which adds 6 to the local clock. Another 
option is to use a timestamp transformation where the 
MAC adds 6 to all timestamps. Using this methodol- 
ogy, we are able to achieve measurement results similar 
to those in Figure 10 using global synchronization. 


6.3 $802.11-like CSMA Protocol 


We implemented two 802.11-like CSMA MAC proto- 
cols, one fully on the host CPU and one using our 
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pairs Avg(Kbps) min max 
platform 1 408 387 = 415 
host ] 215 190 240 
platform 2 205 201 210 
host 2 112 101 = 130 


Table 2: 802.11-like CSMA protocol per-pair results. 


split-functionality optimizations including on-board car- 
rier sense (Section 5.2), dependent packet ACK genera- 
tion (Section 5.5), and backoff (Section 5.3). The MAC 
implements 802.11’s clear channel assessment (CCA), 
exponential backoff, and ACK’ing. Our protocol does 
not implement SIFS and DIFS periods; this work is in 
progress. For space reasons, we focus our description on 
how the 802.11-like protocol uses our architecture. 


The host-based implementation places all functional- 
ity on the host CPU, including carrier sense, ACK gener- 
ation, and the backoff. The optimized implementation 
uses the matched filter and SNR monitoring for ACK 
generation, and performs carrier sense and backoff on 
the radio hardware. We configure the USRPs for a target 
rate of 0.SMbps, and run 100 IMB file transfers for each 
implementation using a center frequency of 2.485GHz in 
an attempt to avoid 802.11 interference. This allows us 
to present results that highlight the differences in the im- 
plementation without the effect of uncontrolled interfer- 
ence. We also vary the number of nodes in the network, 
where each pair of nodes performs a transfer. 


The results for the two implementations are shown 
in Table 2. We see significant performance increases 
from the use of the split-functionality implementation. 
This nearly doubles the throughput on average, likely 
due to the time saved in decoding to generate the ACK, 
and the delays associated with carrier sense and backoff. 
We note that the matched filter detected every framing 
sequence, and the fast-ACK generation technique only 
failed 2 times over the total number of runs. To recover 
from these failures, we implemented a feedback mecha- 
nism on the host that checks the SNR monitoring tech- 
nique’s decision and retransmits. This is needed since we 
did not use a higher-layer recover mechanism like TCP. 


7 Related Work 


We review related work in the area of MAC development. 
Existing platforms mostly use the extremes of the design 
space where either the majority of functionality is fixed 
on the network card (Traditional NICs), or perform all 
processing at the host (Software-defined Radios). 
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7.1 Traditional NICs 


Several efforts [13, 4, 16] have built new MAC protocols 
on top of existing commercial NICs (e.g., 802.11 cards). 
Unfortunately, commercial 802.11 cards implement the 
bulk of the MAC functionality in proprietary microcode 
on the card, limiting what functions can be changed by 
researchers. As a result, this approach is not very sat- 
isfactory: the range of MAC protocols that can be 1m- 
plemented is limited and performance (e.g. throughput, 
capacity) is often poor from the MAC needing to be im- 
plemented on the host. For example, past efforts have 
mostly implemented TDMA-based schemes. 


7.2 Software-defined Radios 


Software-defined radios (SDRs) provide a compelling 
architecture for flexible wireless protocol development 
since most aspects of both the MAC and physical layer 
are, by design, implemented in software and thus 1n prin- 
ciple, easy to modify. However, so far, SDR efforts 
have focused on implementing the physical layer [19] 
while MAC and higher layer protocol development has 
received little attention. 

Recent work by Schmid et al [14] examines the im- 
pact of increased latency in software-defined radios us- 
ing GNU Radio and the USRP. The authors address how 
the bus latency creates “blind spots” that increase colli- 
sion rates when carrier sense is performed at the host, and 
how pre-computation of packets is not possible without 
fully demodulating (at the host), resulting in larger inter- 
frame spacing. Our design provides solutions for both of 
these issues in Sections 5.2 and 5.4, respectively. Bus de- 
lay measurements were also taken by Valentin et al [18]. 

On top of these hardware challenges, the original 
streaming-based design of GNU Radio and the fixed size 
data limitation on its blocks prevents packet process- 
ing. Dhar et al [3] take the approach of integrating the 
Click modular router [12] with GNU Radio. GNU Ra- 
dio blocks are imported into Click to handle the physical 
layer, while Click is used to implement the MAC layer. 
Additionally, the authors interface with the USRP to pro- 
vide a full SDR. Another approach extended the GNU 
Radio architecture with m-blocks [2], blocks that allow 
variable length data passing and include meta-data that 
can be used to represent packets. Our work is comple- 
mentary to the above efforts: while they focus on a MAC 
development environment on the host, we focus on the 
partitioning of MAC layer processing between the host 
and radio hardware. Our architecture and results also do 
not depend on a particular environment on the host. 

A number of groups have developed software radios 
with architectures that differ from the current GNU Ra- 
dio and USRP design by including a CPU on the ra- 
dio hardware (NC-CPU), either as a separate compo- 
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nent or as a core on the FPGA. Examples include the 
Rice University Wireless Open-Access Research Plat- 
form (WARP) [20] and USRP2. These designs are more 
expensive, but they offer additional flexibility for par- 
titioning the MAC. However, there is still a non-trivial 
delay (compared with traditional radios) due to physi- 
cal layer processing and queueing. The NC-CPU is also 
likely to be slower than the host CPU, increasing the pro- 
cessing delay. Finally, in deployed products based on 
this architecture, the NC-CPU is likely to be off-limit to 
users, similar to the current situation with commercial 
wireless cards. As a result, we expect that our architec- 
ture will be useful this type of platform as well. 


$8 Conclusions 


In this paper, we presented a set of techniques that sup- 
port the implementation of diverse, high-performance 
MAC protocols on software radios. The work is mo- 
tivated by the observation that a single one-size fits all 
MAC protocol cannot meet the demands of increasingly 
diverse deployments and application loads. Software ra- 
dios offer flexibility, but their architecture, specifically 
the delay between the host and the radio frontend, has 
traditionally been a problem for MAC protocols. We in- 
troduce a split-functionally approach, which addresses 
this problem, and show that it enables the implementa- 
tion of a set of core MAC functions. An implementation 
for the USRP and GNU Radio, along with the imple- 
mentation of an 802.11-like and Bluetooth-like protocol, 
shows the approach is effective. To our best knowledge, 
these protocol implementations are the first high-speed, 
bi-directional MAC implementations for the GNU soft- 
ware radio platform. For future work, we plan to im- 
plement a more diverse set of MAC protocols to further 
evaluate our design and implement the architecture on 
different SDR platforms to evaluate its generality. 
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This paper describes Antfarm, a content distribution sys- 
tem based on managed swarms. A managed swarm 
couples peer-to-peer data exchange with a coordinator 
that directs bandwidth allocation at each peer. Antfarm 
achieves high throughput by viewing content distribution 
as a global optimization problem, where the goal is to 
minimize download latencies for participants subject to 
bandwidth constraints and swarm dynamics. The sys- 
tem is based on a wire protocol that enables the Antfarm 
coordinator to gather information on swarm dynamics, 
detect misbehaving hosts, and direct the peers’ allot- 
ment of upload bandwidth among multiple swarms. Ant- 
farm’s coordinator grants autonomy and local optimiza- 
tion opportunities to participating nodes while guiding 
the swarms toward an efficient allocation of resources. 
Extensive simulations and a PlanetLab deployment show 
that the system can significantly outperform centralized 
distribution services as well as swarming systems such 
as BitTorrent. 


1 INTRODUCTION 


Content distribution has emerged as a critical applica- 
tion as demand for high fidelity multimedia content has 
soared. Large multimedia files require effective content 
distribution services. Past solutions to the content distri- 
bution problem can be categorized into two approaches, 
namely client-server systems and peer-to-peer swarming 
systems, whose fundamental limitations render them in- 
adequate for many deployment environments. 

In the client-server approach to content distribution, 
the content owner operates a set of servers that pro- 
vide the content to every client without tapping into any 
client-side resources. The presence of a central authority 
simplifies the design of client-server systems, exempli- 
fied by YouTube and Akamai: provisioning the network 
simply requires purchasing sufficient bandwidth for the 
desired quality of service and the targeted number of 
clients; accounting and admission control can be handled 


by the servers; clients can be prioritized and bandwidth 
can be dedicated to desired transfers at fine granularity. 
The chief drawback to the client-server approach is its 
cost and feasibility: the distributor must bear the entire 
bandwidth cost of distributing the content, and operating 
a high-bandwidth data center for a large client population 
can be prohibitively expensive [11]. 

Peer-to-peer swarms offer an emerging alternative, 
where clients interested in downloading a file provide 
content to other clients interested in the same file. 
Swarming protocols transfer part of the bandwidth cost 
from centralized servers to the participants and their ISPs 
by taking advantage of the additional upload capacity of- 
fered by downloading peers. This redistribution of costs 
reduces the bandwidth burden on the servers, helps im- 
prove download times for clients, and reduces ingress 
bandwidth demand for ISPs. Swarming protocols pro- 
posed to date, including BitTorrent [1], Avalanche [24], 
and Dandelion [52], have been designed to resist techni- 
cal and legal attacks by avoiding management and cen- 
tralization. This design choice has led to protocols that 
lack coordination among peers, rely solely on directly- 
obtained measurements to avoid trusting information re- 
layed by peers, and depend on randomization to thwart 
adversaries. The highly decentralized nature of existing 
unmanaged swarming systems leads to a performance 
penalty for legitimate content distributors. 

To understand why unmanaged swarming architec- 
tures fail to make efficient use of bandwidth in multi- 
swarm environments, imagine a content provider with 
two movies to distribute to two sets of users using a set of 
seeders! over which they have full control and at which 
both movies are replicated. Depending on the size of the 
swarms and the nature of the peers that make up each 


'Tn this paper, seeders are trusted servers managed by the coordi- 
nator that distribute data blocks to peers. This is in contrast with Bit- 
Torrent terminology, where seeders are altruistic peers that have fin- 
ished downloading a file and provide content without further down- 
loads themselves. 
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swarm, the two swarms may have vastly different inter- 
nal dynamics. Seeders and peers with blocks belonging 
to multiple swarms face a difficult choice: which swarm 
should they reward with their upload bandwidth? Simple 
heuristics, such as round-robin, are unlikely to work well 
because they do not take swarm dynamics into account. 
The default BitTorrent behavior, which awards download 
slots to the peers with a proven track of fast downloads, 
works well within a single torrent, but can lead to whole- 
sale starvation in a multi-torrent setting. 

The fundamental problem is one of global optimiza- 
tion: the seeders should award their bandwidth such that 
download times across all swarms are minimized. Cur- 
rent swarming protocols lack the mechanisms to com- 
pute and operate at this point. Consequently, adminis- 
trators that run torrent sites manually prune old torrents 
and reallocate bandwidth to more popular downloads by 
hand. This approach is not guaranteed to achieve a good 
allocation of bandwidth, leads to the “heavy-tail” prob- 
lem where old, unpopular torrents are difficult to find, 
and does not scale. 

This paper describes an efficient content distribution 
system, called Antfarm, based on managed swarms. The 
goal of Antfarm is to distribute a large set of files to a 
potentially very large set of clients. Managed swarms 
introduce a hybrid approach to swarming systems in that 
they permit a coordinator, typically managed by the con- 
tent distributor, to control the behavior of the swarms. 

Antfarm is designed to maximize the system-wide 
benefit of the critical resource, seeder bandwidth. Each 
Antfarm peer provides resources to other participants, re- 
ceives unforgeable tokens in return, and receives credit 
for its cooperation by presenting these tokens to the 
central coordinator. The Antfarm token protocol forces 
each participant to divulge its upload contributions to the 
swarm coordinator, which enables the coordinator to de- 
termine swarm dynamics and allocate bandwidth to com- 
peting swarms. This enables the coordinator to exert con- 
trol while enabling peers to use microoptimizations, such 
as optimistic unchoking for peer discovery, tit-for-tat for 
peer selection, and rarest first, to improve the efficiency 
of swarming downloads. Overall, the Antfarm transport 
protocol makes the system resistant to attacks through 
unforgeable tokens, reveals a coarse-grain view of the 
network to the central coordinator, and enables the coor- 
dinator to adopt and enforce a chosen bandwidth alloca- 
tion strategy. 

The key contribution of this paper is the design of an 
efficient and scalable coordinator for multiple, concur- 
rent swarms. Given the internal dynamics of a set of 
swarms, we show how to optimize bandwidth among the 
swarms such that average download latencies are mini- 
mized across all peers. If desired, the algorithm can guar- 
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antee a minimum service level to certain swarms, avoid 
starvation, and enforce prioritization among swarms. 
Minimizing the average download latency in turn enables 
a content distributor to achieve the best possible service 
from the available bandwidth. 

This paper makes two additional contributions for 
achieving high throughput in a practical multiple-swarm 
download service. First, the paper presents a wire-level 
protocol for accurately measuring the internal dynamics 
of individual swarms by making peer contributions evi- 
dent to the coordinator, enabling the coordinator to opti- 
mally allocate bandwidth among the competing swarms. 
Second, a full implementation of the protocol, accompa- 
nied by extensive simulations and a deployment on Plan- 
etLab, quantifies the performance of Antfarm against a 
client-server system and BitTorrent. In our experiments, 
Antfarm achieves aggregate bandwidths up to a factor of 
five higher than BitTorrent, and the protocol scales well 
with increasing peers and swarms. 

The rest of this paper is structured as follows. The 
next section describes the Antfarm system and the cen- 
tral optimization that underlies the approach. Section 3 
outlines the protocol that Antfarm uses for data distri- 
bution. Section 4 shows that the system achieves high 
performance. Section 5 describes related work and high- 
lights Antfarm’s differences, and Section 6 summarizes 
our contributions. 


2 APPROACH 


Antfarm is based on a hybrid peer-to-peer architecture 
that utilizes resources provided by peers according to 
an optimal strategy for managing multiple swarms com- 
puted by a coordinator. Each coordinator can manage 
multiple swarms, a single peer may participate in swarms 
managed by multiple coordinators, and coordinators may 
be physically replicated to scale with the number of peers 
and swarms. For simplicity, we assume a single coordi- 
nator in the following discussion and address the issue of 
scale in Section 3. 

The coordinator’s central task is to achieve the shortest 
possible download times across multiple swarms. Find- 
ing the right allotment of bandwidth among swarms is 
best viewed as a constrained optimization problem. The 
primary constraint is the available bandwidth at the seed- 
ers. The primary input to this optimization problem is the 
inherent response curve of each swarm. The response 
curve represents the swarm bandwidth as a function of 
allocated seeder bandwidth. It depends on the number 
of peers in the swarm, number of seeders, spare band- 
width on upload and download links of swarm partici- 
pants, and the distribution of unique blocks. Peers’ local 
decisions also influence their swarms’ response curves, 
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as peers can advertise a lower upload bandwidth capacity 
than they are capable of providing. However, the Ant- 
farm wire protocol, discussed in Section 3, encourages 
peers to cooperate within their swarms, granting the co- 
ordinator more available bandwidth to optimally allocate 
among all swarms in the system. 

Response curves embody the critical properties of 
each swarm and have a characteristic shape—a fact we 
exploit in this work. Figure | illustrates the characteris- 
tic form of the response curve for a homogeneous swarm 
with static membership; for illustration purposes in this 
example, peer download capacities exceed upload ca- 
pacities, and the set of peers does not change through- 
out the download. When the seeder bandwidth is small, 
the peers in the swarm have unused upload and down- 
load capacity. In this regime of operation (region A), 
the swarm’s aggregate bandwidth increases rapidly with 
the seeder bandwidth, since peers can use their spare up- 
load bandwidth to forward new blocks to other peers. 
Each individual block the seeders feed into the swarm 
will be shared among many peers, highly leveraging the 
bandwidth committed by the seeder. Once the peers 
in a homogeneous swarm have saturated their uplinks, 
the marginal benefit from additional seeder bandwidth 
drops significantly. In this regime (region B), any addi- 
tional bandwidth that a peer receives only benefits that 
peer, since saturated upload links render it unable to for- 
ward the data to other peers. Finally, once downlinks of 
swarm participants are saturated (region C), the swarm 
has reached its maximum aggregate bandwidth. Further 
bandwidth provided by the seeders will not impact down- 
load latency. If download capacities are lower than up- 
load capacities, region B will simply not exist, yielding a 
response curve with only two regions. 

A coordinator relies on two key properties of response 
curves to maximize the achieved aggregate swarm band- 
width while respecting the seeder bandwidth constraint. 
First, response curves are monotonic: a swarm’s aggre- 
gate bandwidth will never decrease as a result of increas- 
ing the seeder bandwidth to the swarm. Second, response 
curves are concave; that is, their derivatives monoton- 
ically decrease over possible seeder bandwidths. Con- 
cavity implies that a swarm’s aggregate bandwidth ex- 
hibits diminishing returns as the seeders increase their 
bandwidth to the swarm. When the seeders increase their 
bandwidth beyond a swarm-specific threshold, the peers’ 
uplinks and downlinks saturate, decreasing their ability 
to receive and forward data from the seeders and other 
peers. 

Real-life swarms are more complex than the idealized 
swarms discussed above in that they may comprise het- 
erogeneous hosts and exhibit peer churn. They neverthe- 
less exhibit several critical properties that Antfarm ex- 
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Figure 1: Response curves of a theoretical homogeneous 
swarm and a measured heterogeneous swarm on Planet- 
Lab. Aggregate bandwidth increases rapidly as seeder band- 
width increases (A) until peer uplink capacity is exhausted (B) 
and reaches its maximum when downlinks are saturated (C). 


ploits. In heterogeneous swarms, where peer uplinks and 
downlinks are nonuniform, the transitions between the 
disparate regions of the response curves are smoother. 
This is because different peers’ upload and download ca- 
pacities saturate at different points, smoothing the dis- 
continuous transition seen in a homogeneous swarm. In 
addition to heterogeneity, real swarms exhibit peer churn, 
where peers can join at any time and leave due to failure, 
cancellation, or completion. Such membership changes 
shift the response curve because their influence affects 
the swarm’s dynamics, but do not violate the monotonic- 
ity and concavity properties outlined above. Section 3 
describes how Antfarm maintains an accurate view of 
the system and adjusts its behavior in the presence of dy- 
namic membership. 


The monotonicity and concavity of swarm response 
curves form the foundation of Antfarm’s multiple-swarm 
optimization. Intuitively, when a seeder is supporting a 
swarm that has a large number of saturated peers, such as 
in regions B or C in Figure 1, it should reduce its band- 
width to that swarm and divert it to a swarm whose peers 
can readily share additional bandwidth. More generally, 
given a response curve for each swarm Antfarm is cur- 
rently distributing, the coordinator “climbs” each of the 
curves, always preferring the steepest curve, until it has 
allocated all seeder bandwidth. The resulting point of 
operation on each curve represents the amount of band- 
width the seeders plan to feed to each swarm and the ex- 
pected aggregate bandwidth within each swarm based on 
the seeder bandwidth. Given each swarm’s measured re- 
sponse curve, this allocation of seeder bandwidth is opti- 
mal [40]: decreasing the seeder bandwidth to one swarm 
in favor of another will not improve the overall perfor- 
mance of the system. Antfarm’s allocation of seeder 
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Figure 2: Optimal bandwidth allocation for three concur- 
rent swarms. The Antfarm coordinator awards seeder band- 
width by hill-climbing the steepest response curves first until 
all available bandwidth has been allocated. 


bandwidth ensures that the content distributor achieves 
the highest performance possible from its servers’ band- 
width. 

The optimization process described above may reach 
a point at which the seeders have excess bandwidth to 
award, yet the derivatives of multiple response curves 
are identical, indicating that multiple swarms offer the 
same global benefit (Figure 2). In such cases of equiva- 
lent global benefit, Antfarm uses a tie-breaker algorithm 
to maximize perceived improvement by peers. Suppose 
that two swarms ¢, and tz have response curves with 
equivalent slopes at seeder bandwidths s; and s9, corre- 
sponding to swarm aggregate bandwidths of a, and ao, 
with a1 > a2. While this indicates that awarding a block 
to either swarm would improve average download times 
across the entire network by an equal amount, the in- 
cremental benefit to members of ¢;, which already en- 
joy a larger aggregate throughput, is small compared to 
the relative improvement that members of t2 would per- 
ceive. Consequently, Antfarm breaks ties by awarding 
bandwidth to swarms with lower bandwidth when mul- 
tiple response curves have the same slope. This mecha- 
nism ensures that the system maintains its primary goal 
of minimizing download time, while the participants re- 
ceive maximal marginal benefit whenever there is free- 
dom in making a bandwidth allocation that is in line with 
the primary goal. 


3 IMPLEMENTATION 


The Antfarm implementation is centered around a token- 
based wire protocol that implicitly reveals peer dynam- 
ics to the coordinator. This section provides an overview 
of the Antfarm implementation, outlines the wire pro- 
tocol and the use of tokens, and describes how tokens 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 


are used in the larger context of bandwidth allocation. 
We illustrate the common case first and treat the corner 
cases stemming from token misuse, peer misbehavior, 
and overall scalability in Sections 3.4 and 3.5. 


3.1 Overview 


An Antfarm deployment consists of two types of servers 
provided by the content provider. Coordinators man- 
age the system by issuing tokens, computing response 
curves, and determining bandwidth allocations. Seed- 
ers expend their bandwidth to distribute blocks of files 
to peers. For small deployments, a single server machine 
can act as both coordinator and seeder, while large de- 
ployments will comprise multiple physical hosts. 

Antfarm seeders are members of all swarms and dis- 
tribute data blocks without downloading any themselves. 
They may be under the direct administrative control of 
the coordinator, or they may be deployed by ISPs to re- 
duce their ingress bandwidth demand; in either case, they 
may be geographically distributed to improve bandwidth 
to peers. Seeders do not demand tokens from peers in 
exchange for blocks because they do not place resource 
demands on the system. 

Peers interact with coordinators, seeders, and each 
other to download files. Each peer in Antfarm is identi- 
fied by a certificate acquired from the coordinator during 
an initial, one-time registration. Once a connection with 
a peer has been established and the peer has been au- 
thenticated with the coordinator, wire messages identify 
peers using a public IP address and port pair that is short- 
hand for the verified certificate. Antfarm assumes that 
peers are either rational, where the protocol will incen- 
tivize them to contribute resources to the global pool, or 
malicious, where they may behave in a Byzantine man- 
ner; the protocol is resilient to such malicious hosts (see 
Section 3.5). 

The Antfarm wire protocol is designed around peer- 
to-peer data exchange in return for tokens. A token is 
a cheap, unforgeable capability that the bearer may ex- 
change for a data block in a given swarm. Logically, 
a token is composed of a unique, randomly generated ID 
string, an expiration time after which the token is invalid, 
a reference to the intended spender of the token, and a 
reference to the file for which the token should be spent. 
The coordinator records these four fields when it mints 
a new token for a particular peer. A token can only be 
spent by the peer to which it was issued in exchange for 
blocks of the designated file; tokens are not interchange- 
able between swarms. 

Downloadable files in Antfarm are described by a 
“ant” swarm description, analogous to a “.torrent”’ file, 
which contains the name of the file, the address and port 
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of the coordinator managing the swarm, data block size, 
and a hash of each data block. 


3.2 A Peer’s Perspective 


An Antfarm peer joins a swarm by opening a connection 
to the swarm’s coordinator and authenticating itself us- 
ing its peer certificate. Once a connection has been estab- 
lished, all correspondence with the coordinator and peers 
occurs with the exchange of protocol messages summa- 
rized in Table 1. When a new peer joins a swarm, the 
coordinator sends the peer a subset of the peers in the 
swarm and an initial allowance of tokens unless the new 
peer has a history of malicious behavior. The peer can 
similarly join additional swarms, acquiring peer lists and 
initial tokens for each. 

The basic data transmission protocol in Antfarm has 
three phases consisting of peer and block selection, data- 
for-token exchange, and bandwidth allocation. 

Peers may determine their own criteria for selecting 
peers and blocks. This enables Antfarm peers to per- 
form optimizations based on local information, reduc- 
ing the burden on the centralized coordinator. The de- 
fault behavior in Antfarm for peer and block selection is 
identical to BitTorrent. Peers retain a prioritized list of 
other peers with which to exchange data blocks (to un- 
choke). The priority order is determined by the running 
average bandwidth achieved through that peer’s history 
of interactions. Blocks are chosen using a rarest-first al- 
gorithm; peers maintain a bitmap of blocks held by each 
connected peer constructed from block acquisition noti- 
fications sent by peers after each block transfer. Since 
swarming systems that rely solely on local information 
and randomized interactions may operate at reduced ef- 
ficiency due to lack of information [30], the Antfarm 
coordinator uses its global knowledge to influence peer 
selection. The coordinator monitors each peer’s upload 
history and identifies underutilized peers. It sends lists 
of such peers as candidates for data exchange through an 
asynchronous notification. This is an advisory notifica- 
tion that causes the recipient to increase the priority of 
the named, underutilized peers. This is a no-cost opti- 
mization for Antfarm; a peer is under no obligation to 
follow the recommendations and the protocol’s correct- 
ness does not depend on the peer-selection algorithm. 
This process of aiding peer selection could be improved 
by the use of network proximity measures [19, 33, 41], 
though our current implementation does not yet include 
this optimization. 

Once a peer (receiver) has chosen another peer 
(sender) and determined a suitable block for download, 
it sends a data-exchange request. If the sender has un- 
choked the receiver, it sends the requested block to the re- 





Connections 


handshake Sent by peers to establish connections; in- 
cludes the identifier of a file the sender wants to down- 
load and the public port of the sender. 


handshake_response Sent in response to a handshake. 


jJoin.swarm Sent to the coordinator to become a swarm 
member. 


leave.swarm Sent to the coordinator to be removed from 
a swarm. 

time_request Sent by a peer to the coordinator to get the 
system time. 

time_response Sent in response to a time_request; con- 
tains the time according to the coordinator. 


Node state 


choke Informs the recipient that the sender is not accept- 
ing block requests from the recipient. 

unchoke Informs the recipient that the sender is now 
accepting block requests from the recipient. 

interested Informs the recipient that it has at least one 
block that the sender needs. 

not_interested Informs the recipient that the recipient 
does not have any blocks that the sender needs. 

have_block A notification sent to directly-connected 
peers when a peer receives a new block. 

bitfield Contains a bitfield of all the blocks the sender 
possesses. Normally sent after establishing a new con- 
nection. 


Block transfers 
request A request for a specific block. 
block A block of file data, sent in response to a request. 


Swarm info 


peer_request Sent by a peer to the coordinator to request 
a set of peers in the swarm. 

peer_response A Set of peers’ addresses and ports. 

good_peers Sent periodically by the coordinator to each 
peer to notify them of peers to unchoke. 

bad_peers A notification containing a set of peers the 
coordinator has identified as malicious. 

allocation Sent by the coordinator to inform peers of the 


desired allocation of their upload bandwidth 


Token management 


new-tokens Sent by the coordinator to deliver a set of 
fresh tokens to a peer. 

token_receipt Receipt for a block transfer; sent from one 
peer to another in response to a block message. 

token_ledger Contains a set of spent tokens sent to the 
coordinator in exchange for fresh tokens. 

token_replace Contains a set of fresh tokens sent to the 
coordinator in exchange for new tokens with later expi- 
ration times. 


Table 1: Antfarm wire protocol. A comprehensive list 
of peer-peer and coordinator-peer messages. The protocol 
comprises messages to establish connections, notify peers of 
progress and status, exchange blocks, and handle tokens. 
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ceiver. Upon completion of the transfer, a non-malicious 
receiver checks the hash of the block against the hash 
specified in the swarm description and sends an unex- 
pired token to the sender of the data block. Each peer 
maintains a purse of unused tokens issued by the coordi- 
nator for use by that peer, and a edger of tokens received 
from other peers in exchange for data blocks. Tokens 
flow from the purse of the receiver to the ledger of the 
sender. 

Peers communicate periodically with the coordinator 
to refresh their purses and ledgers. Each unexpired to- 
ken in the ledger entitles the peer to a fresh token for its 
purse. This communication takes place every minute in 
the current implementation. If a newly received token in 
the ledger is going to expire before the next scheduled re- 
fresh, or if the purse contains nearly expired unspent to- 
kens, the peer can preemptively redeem selected tokens 
for new tokens with later expiration times. 

Peers following the above protocol will face a stream 
of competing requests for data blocks. Peers use a leaky- 
bucket algorithm to restrict upload bandwidth according 
to the coordinator-prescribed allocation. Altruistic peers 
that finish downloading a file may remain in the swarm 
and continue to upload content, functioning similarly to 
seeders. 


3.3. The Coordinator’s Perspective 


The coordinator collects statistics on peer network be- 
havior, computes response curves and bandwidth alloca- 
tions for each peer and seeder, and steers the swarm to- 
ward an efficient operating point. It affects these through 
manipulation of the token supply and direct interaction 
with cooperative peers. Finally, it keeps track of mali- 
cious and uncooperative participants, excising them from 
the network when their misbehavior affects performance. 
The primary task of the coordinator is to monitor 
network characteristics and swarm dynamics by keep- 
ing track of tokens for each data block transaction be- 
tween peers. Each token the coordinator receives informs 
the coordinator of the swarm in which a transaction oc- 
curred, the specific peers involved in the transaction, and 
a window of time in which the data block was transferred 
based on the token’s minting and expiration times. This 
information is sufficient to maintain two key parameters 
for each peer p: the set of swarms 7, that p is a mem- 
ber of and a rolling average of its upload bandwidth wu,. 
In addition, the coordinator keeps track of the set of all 
seeders S and two pieces of state for each swarm: a set 
P, of peers in swarm ¢ and a response scatterplot for each 
swarm, represented as a collection of data points with 
associated time-decaying weights. Data points decay ac- 
cording to 1/t and are removed after 30 minutes. 
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Figure 3: Bandwidth allocation. The black dots denote the 
allocation of bandwidth for swarm ¢ before and after one it- 
eration of allocation. For each Ao tasked to a seeder by the 
hill-climbing algorithm, a randomly selected peer with spare 
upload capacity is tasked with allocating a corresponding Ao. 
The dotted line has a slope of 1, accounting for the seeders’ 
contribution to the swarm’s aggregate bandwidth. 


The coordinator chooses swarms to grant bandwidth 
based on collected swarm statistics. The response scat- 
terplots are not immediately suitable for use in comput- 
ing bandwidth allocation, as they contain artifacts due 
to measurement errors and changes over time, creating 
false local minima and maxima. The coordinator gen- 
erates a response curve from a response scatterplot by 
fitting a piecewise linear function that respects the mono- 
tonicity and concavity constraints, contains a segment 
for each measurement point, and minimizes error using 
least-squares. 


The coordinator computes the amount of bandwidth 
each seeder and peer should dedicate to each swarm 
based on the computed response curves, represented as 
two matrices o and 0. For each swarm t, o, 4 captures 
the amount of bandwidth seeder s will dedicate to t, and 
d»,t Captures the amount of bandwidth peer p is expected 
to dedicate to t. This determines the critical allocation of 
seeder upload bandwidth o; = °, <g Os,t to swarm ¢ in 
order to achieve a swarm aggregate bandwidth (0; + 6), 
where 0¢ = )),¢p, Op,t is the bandwidth component re- 
sulting from peer-to-peer uploads. The coordinator com- 
putes this allocation periodically, every 5 minutes in our 
current implementation, and also when the area under 
the curve has changed by more than 10%. In comput- 
ing o and 0 the coordinator operates under two hard con- 
straints. First, 6, = DieT, dnt Can never exceed p’s 
upload capacity u,. Second, the node must have the file 
to seed; a peer will never be tasked to upload blocks of a 
file it is not interested in downloading. The coordinator 
determines o and 0 iteratively. Initially, 7,4 = dn 4 = 0 
for all peers p, seeders s, and swarms ¢. The coordi- 
nator determines the allocation of bandwidth through a 
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greedy hill-climbing algorithm using the computed re- 
sponse curves and its knowledge of the seeders’ upload 
capacities, illustrated in Figure 3. It allocates bandwidth 
in discrete units to the swarms whose response curves 
have the highest gradient, breaking ties in favor of the 
swarm with the lower value of (o + 6), as described in 
Section 2. For each increase in seeder bandwidth Ao; to 
swarm ¢, the algorithm chooses a peer at random from 
P, with spare upload bandwidth and tasks it with up- 
loading an additional Ao; to t, as prescribed by t’s re- 
sponse curve. The coordinator continues the process un- 
til all seeder bandwidth has been allocated. The final 
peer allocation 6 satisfies the two critical constraints de- 
scribed above and ensures that peer transfers within each 
swarm achieve the previously measured aggregate band- 
width based on the seeders’ allocation oc. 

Computation of bandwidth allocation is not a highly 
time-critical task. Delays in network measurements and 
peer interactions imply inherent delays between comput- 
ing an allocation and seeing a change in the network. 
Since the latency of computing the bandwidth allocation 
is dwarfed by the latency of data exchange, the computa- 
tion can be performed in the background. The optimiza- 
tion algorithm is linear in the number of peers and grows 
according to O(n lg) with the number of swarms, en- 
abling the system to scale. The primary metric that deter- 
mines the quality of the solution is the freshness of data 
on swarm dynamics. 

Antfarm’s token protocol incentivizes peers to report 
Statistics to the coordinator in a timely manner. A to- 
ken’s expiration time (5 minutes in the current imple- 
mentation) and spender-specificity force peers to return 
tokens to the coordinator in order to receive bandwidth 
in the future. The circulation of tokens reveals enough 
information to the coordinator to perform the allocation 
described above. 

Token-based economies can suffer from inflation, de- 
flation, and bankruptcy if left unmonitored. Based on 
analyses of scrip systems [32], the Antfarm coordina- 
tor maintains a constant number of tokens per swarm per 
peer (30 in the current implementation). New peers re- 
ceive an initial allowance of 30 tokens. As unspent to- 
kens expire, the coordinator redistributes an equal num- 
ber of new tokens to random peers to prevent a token 
deficit when peers depart with positive token balances. 
Token unforgeability prohibits deflation, and token redis- 
tribution enables bankrupt peers to acquire new blocks 
and reintegrate themselves into the swarm. 

The coordinator rewards peers that contribute to the 
system both directly, by offering seeder bandwidth to 
peers that have donated bandwidth to peers, and indi- 
rectly, by suggesting which peers are underutilized. The 
latter partly influences unchoking decisions as described 


previously. The coordinator determines this list for each 
peer by selecting a small subset of the top uploaders to 
that swarm, chosen randomly from a probability distri- 
bution determined by upload bandwidth. 

Peer churn and changes in network conditions cause 
response curves to become stale over time. In addi- 
tion, transient measurement errors can skew response 
curves, causing the system to operate suboptimally. Ant- 
farm maintains response curves by actively exploring 
the swarm’s response at different seeder bandwidths. In 
each epoch, the coordinator randomly perturbs the cur- 
rent bandwidth allocation by a small amount for each 
swarm, on the order of 5 KB/s (kilobytes per second). 
Such variances provide additional datapoints for the re- 
sponse scatterplot, enabling the system to overcome false 
local minima due to transient effects. 

The coordinator does not enforce peers’ compliance 
with the coordinator’s directives in allocating their up- 
load bandwidth. A peer is free to shift bandwidth away 
from one swarm in favor of another at its discretion. In 
such a scenario, the coordinator will simply observe a 
shift in the swarms’ dynamics, which will be reflected in 
the response curves. In the next epoch, the coordinator 
will perform a new bandwidth allocation that takes the 
peer’s behavior into account. 


3.4 Scalability 


The Antfarm coordinator is optimized to ensure that the 
logical centralization does not pose a CPU or bandwidth 
scalability bottleneck. 

Shuttling tokens to and from the coordinator for each 
data block transaction is the main source of coordinator 
bandwidth expenditure. To reduce the burden, Antfarm 
does not rely on public-key cryptography to issue or ex- 
change tokens. The Antfarm protocol minimizes the size 
of tokens on the wire, transmitting only relevant fields 
when tokens change hands. Only a token’s ID, file refer- 
ence, and expiration time are sent on the wire when the 
coordinator sends fresh tokens, and only the ID and expi- 
ration time are sent on the wire when a peer sends another 
peer a token. Spent tokens sent back to the coordinator 
are represented with only the token’s ID and the identifier 
of the peer that spent the token. Using 4-byte token IDs, 
each token exchange requires less than 24 bytes of to- 
tal bandwidth and less than 16 bytes of bandwidth at the 
coordinator for each data block of around 32-128 KB. 

Antfarm uses highly compact versions of token iden- 
tifiers to reduce bandwidth. A 4-byte ID is sufficient to 
disincentivize forgery because the coordinator will detect 
a malicious peer’s attempt to forge a token upon its first 
failure to produce a legitimate token. In the event that a 
peer correctly guesses an active token’s ID, it is unlikely 
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to correctly identify the token’s intended spender. In the 
worst case, should a peer successfully forge a token, it 
will only gain one data block for its efforts, whereas fail- 
ures will lead to remedial action against the peer, de- 
scribed in Section 3.5. Thus, with 4-byte token IDs, sev- 
eral million peers, and several hundred million tokens, 
the likelihood of a successful, undetected token forgery 
is around 10~® when tokens are uniformly distributed. 
With a skewed token distribution where some peers have 
100 times more tokens than the average peer, the like- 
lihood might rise as high as 10~°. Downloading ten 
blocks with forged tokens is as likely as discovering a 
collision for a cryptographically secure hash function. 


The Antfarm coordinator expends its bandwidth to 
send tokens to peers, receive spent tokens back from 
peers, and periodically send swarm allocations and lists 
of top contributors to peers and seeders. To alleviate the 
bandwidth demands placed on the coordinator, the Ant- 
farm protocol enables the coordinator to be distributed 
hierarchically. A lead coordinator machine handles com- 
puting response curves and determining swarm band- 
width allocations. The remaining coordinators, called 
token coordinators, issue tokens, collect tokens back 
from peers, and periodically send each peer’s upload and 
download rates to the lead coordinator each time the lead 
coordinator computes bandwidth allocations. The lead 
coordinator redirects each peer to a token coordinator 
based on a hash of the peer’s IP address. When a token 
coordinator receives a spent token from an assigned peer, 
it applies the same hash function to the IP address of the 
token’s original owner, a field in the token itself, so it can 
verify the token with the token coordinator that issued it. 
Thus, each token exchanged between peers involves at 
most two token coordinators. 


Token coordination is an embarrassingly parallel task. 
The high ratio between token size and data block length 
ensures that the coordinator bandwidth is leveraged sev- 
eral thousand-fold. Section 4 shows that distributing the 
coordinator incurs negligible overhead and that the par- 
allel nature of token management enables the system to 
grow linearly with the number of coordinator machines. 


The coordinator performs two periodic CPU-bound 
tasks: 1t computes response curves from scatterplots and 
allocates seeders’ and peers’ bandwidth. These tasks are 
computed centrally in order to derive bandwidth alloca- 
tions based on the most recent measurements. Our cur- 
rent implementation on a 2.2 GHz CPU with 3 GB of 
memory takes 6 seconds to perform these computations 
for 1,000,000 peers and 10,000 swarms whose populari- 
ties follow a realistic Zipf distribution. The lead coordi- 
nator can easily be replicated to mask network and host 
failures. 
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3.5 Security 


A formal treatment of the security properties of the un- 
derlying Antfarm wire protocol is beyond the scope of 
this paper. Past work on similar, though heavier-weight, 
protocols [52] has established the feasibility of a secure 
wire protocol. Consequently, the focus of this section is 
to enunciate our assumptions, describe the overall goals 
of the protocol, provide design alternatives, and outline 
how to mitigate attacks targeting the bandwidth alloca- 
tion algorithm. 


Antfarm makes standard cryptographic assumptions 
on the difficulty of reversing one-way hashes and as- 
sumes that peers cannot snoop on or impersonate other 
peers at the IP level. Violation of the first assumption 
would render the Antfarm wire protocol, as well as most 
cryptographic algorithms, insecure; consequently, much 
effort has gone into the design of secure hash functions. 
Violation of the second assumption is unlikely without 
ISP collusion, and damage is limited to IP addresses that 
an attacker can successfully snoop and masquerade. 


Antfarm requires peers to contribute bandwidth to 
their swarms, engage in legitimate token-for-block trans- 
actions with other peers, and report accurate statistics to 
the coordinator. The token protocol, coupled with verifi- 
cation at the coordinator, ensures detection of dishonest 
peers with relatively low overhead. 


In order to measure accurate response curves, the co- 
ordinator verifies that all token transactions occur within 
the intended swarm, by the intended peer, and within the 
intended period of time. The coordinator detects token 
forgery upon receiving an invalid token from a peer by 
simply comparing the token ID against its own registry 
of active tokens. Similarly, the coordinator compares its 
own record of the intended sender with the spender as 
reported by the peer returning the token to prevent peers 
from spending maliciously obtained tokens. Peers are 
required to report the actual spender in order to receive 
a fresh replacement token. The coordinator detects all 
counterfeit tokens, but when it detects an invalid token, 
it is unable to differentiate the peer sending the token 
from its ledger from the peer that originally spent the to- 
ken as the culprit. Therefore, it notifies both peers of the 
forgery so the honest peer can blacklist the culprit. 


To hold peers more accountable for their actions when 
the coordinator is unable to precisely identify malicious 
peers, Antfarm peers employ a strikes system to record 
and act on undesirable behavior. Peers maintain a tally 
of strikes against other peers and disconnect from peers 
that have exceeded a threshold. By default, misbehaviors 
that can stem from network congestion, such as a late 
response to a block request or payment with a recently 
expired token, result in one strike against the offending 
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peer. Circulating a counterfeit token results in automatic 
termination of the connection. In general, when the co- 
ordinator is unable to determine the identity of a mali- 
cious peer, it appeals to the strikes system rather than 
erroneously penalizing an honest peer. While it is pos- 
sible to build a centralized reputation system for peers, 
the current Antfarm implementation avoids this to reduce 
burden on the coordinator. 

Using cryptographically signed tokens can provide 
stronger guarantees than Antfarm currently does at the 
cost of additional overhead and complexity. In such a 
scheme, the coordinator can sign all minted tokens be- 
fore issuing them to peers, enabling peers to verify that 
they are exchanging legitimate tokens with each other 
during each transaction. In addition, if the spender of 
a token were required to sign the token before send- 
ing it, peers could prove the identities of token double- 
spenders. ‘Token signatures would prevent malicious 
peers from snooping packets and tampering with tokens 
without the recipient’s knowledge. Antfarm does not im- 
plement a cryptographic scheme because the added over- 
head is not accompanied by a clear increase in perfor- 
mance. 

It is possible for Antfarm peers to collude in order to 
coerce the coordinator into providing their swarm with 
more bandwidth. In particular, peers could band together 
and send each other large numbers of tokens without 
sending each other blocks in exchange. The resulting in- 
flated estimate of that swarm’s aggregate bandwidth can 
lead the system to deviate from a desired allocation. Sev- 
eral techniques mitigate such attacks. First, the coordi- 
nator never issues more tokens than strictly necessary to 
download the file, thereby bounding the impact of fake 
transactions by the number of Sybils. Second, forcing 
participants to register with a form of hard identity, such 
as credit card numbers, can mitigate Sybils [12]. Finally, 
the coordinator can mandate that peers trade with a di- 
verse set of peers, reducing the effect of collusion among 
a small fraction of the swarm. Although the token proto- 
col does not eliminate the possibility of malicious behav- 
ior, its simplicity and ability to detect malicious activity 
limit the harm peers can inflict. 


3.6 Summary 


The Antfarm protocol strikes a balance between micro- 
managing peers and granting them freedom over block 
transfers. Tokens that must be returned to the coordi- 
nator enable the coordinator to collect accurate statistics 
on swarm dynamics and peer behavior. Systems such as 
BitTorrent, which grant peers full autonomy, do so at the 
expense of control and efficiency. At the other extreme, 
a centralized solution that precomputes the entire down- 


load schedule for all participants would limit peers’ abil- 
ity to quickly determine which peers have blocks they 
require and retrieve them without intervention. Antfarm 
provides a hybrid approach that leaves peers free to de- 
termine their own local behavior while extracting suffi- 
cient information from the network to compute the glob- 
ally optimal allocation of available bandwidth among 
swarms. 


4 EVALUATION 


We have implemented the full protocol described in this 
paper, as well as a simulator of the Antfarm and BitTor- 
rent protocols. The Antfarm deployment runs on Win- 
dows, Linux, and Mac OS X. Both the implementation 
and the simulator contain optimizations present in ver- 
sion 5.0.9 of BitTorrent, including optimistic unchoke, 
regular unchoke, and local-rarest-block-first. For the ex- 
periments in this section, Antfarm’s system parameters 
(block size=64KB, optimistic unchoke interval=30s, reg- 
ular unchoke interval=10s) are identical to those found in 
this version of BitTorrent. We pick upload and download 
bandwidths representative of cable-connected end nodes. 

This section evaluates the performance of the Ant- 
farm protocol in comparison to BitTorrent and tradi- 
tional client-server approaches. Through simulations, we 
illustrate scenarios under which BitTorrent misuses its 
seeder capacity and show how Antfarm can achieve qual- 
itatively higher performance by allocating seeder band- 
width to swarms that provide the highest return. A Plan- 
etLab deployment confirms Antfarm’s allocation strategy 
under realistic network conditions. Lastly, this section 
shows that Antfarm’s coordinator can scale to support 
large deployments using modest resources. 


4.1 Simulations 


The differences between Antfarm and BitTorrent in a 
multi-swarm setting stem from the way the two protocols 
allocate their bandwidth to competing swarms. Whereas 
BitTorrent seeders allocate their bandwidth greedily to 
peers that absorb the most bandwidth, Antfarm allocates 
the precious seeder bandwidth preferentially to swarms 
whose response curves demonstrate the most benefit. As 
a result, there is a qualitative and significant difference 
between the two protocols; under some scenarios, Bit- 
Torrent can starve swarms and perform much worse than 
Antfarm, while in others with ample bandwidth, seeder 
allocation may have little impact on client download 
times. Figure 4 shows Antfarm’s performance in com- 
parison to BitTorrent and a traditional client-server sys- 
tem similar to YouTube for two swarm distribution sce- 
narios. In the bimodal scenario, there is a single swarm 
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Figure 4: Aggregate bandwidth for a client-server system, 
BitTorrent, and Antfarm. When seeder bandwidth is plen- 
tiful, even a client-server model can deliver high throughput. 
When seeder bandwidth is limited, Antfarm outperforms Bit- 
Torrent by allocating bandwidth to swarms that receive the 
most benefit. Error bars indicate 95% confidence intervals. 


of 30 peers and 30 swarms of one peer each. The Zipf 
scenario comprises swarms of sizes 50, 25, 16, 12, 10, 8, 
and 5, and 400 singleton participants. Each set of three 
bars shows the average aggregate bandwidth for a corre- 
sponding scenario and seeder bandwidth. 

Overall, Antfarm achieves the highest aggregate 
download bandwidth. In scenarios where there is ample 
seeder bandwidth, the differences between the three sys- 
tems are negligible and even a client-server approach is 
competitive with BitTorrent and Antfarm. As available 
seeder bandwidth per peer drops, however, swarming 
drastically outperforms the client-server approach, high- 
lighting the efficiency of peer-to-peer over a client-server 
system using comparable resources. For the scaled-down 
but realistic Zipf scenario, Antfarm achieves a factor of 
5 higher aggregate download bandwidth than BitTorrent. 
BitTorrent misallocates bandwidth by preferentially un- 
choking hosts based on their recent behavior, regardless 
of their potential to share blocks. In contrast, Antfarm 
steers the seeder’s capacity to swarms where blocks can 
be further shared among peers. 

Antfarm’s dynamic bandwidth allocation adapts well 
to changes in swarm dynamics. A well-known phe- 
nomenon is that when swarms become large, they are 
often able to saturate their peers’ uplinks, and some- 
times even their downlinks, without the aid of seeder 
bandwidth. Such self-sufficient swarms yield flat re- 
sponse curves. Antfarm’s allocation strategy naturally 
avoids dedicating bandwidth to self-sufficient swarms 
when there are other swarms that can benefit more. In 
contrast, BitTorrent does not take swarm dynamics into 
account, and can end up dedicating seeder bandwidth at 
the exclusion of available peer bandwidth, leading to a 
shortage of seeder bandwidth for other, needier swarms. 
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Figure 5: Bandwidth of a singleton swarm and a large, self- 
sufficient swarm. Even though a self-sufficient swarm can sat- 
urate its peers’ bandwidth without seeder bandwidth, BitTor- 
rent awards bandwidth to peers in the swarm. In contrast, Ant- 
farm awards seeder bandwidth to the singleton swarm because 
it receives high marginal benefit. 


Figure 5 shows an exaggerated scenario that illustrates 
this effect. The figure shows the average download band- 
widths of peers in BitTorrent and Antfarm of the two 
swarms. In this scenario, the seeder has a capacity of 
100 KB/s, and each peer downloads a 10 MB file with 
30 KB/s download capacity and 10 KB/s upload capac- 
ity. The self-sufficient swarm saturates peers’ uplinks 
without seeder bandwidth and has a fresh peer arrive 
every second, resulting in a swarm of approximately 
1000 peers. The Antfarm coordinator determines that the 
self-sufficient swarm does not benefit from seeder band- 
width, and awards bandwidth to the singleton swarm in- 
stead. Under Antfarm, the singleton peer is able to com- 
plete its download in an average of 6 minutes. BitTorrent 
fails to provide the singleton swarm any bandwidth over 
the course of the 20 minute simulation. 

The problems with BitTorrent’s allocation strategy are 
compounded in larger, more realistic scenarios. While 
large swarms are often self-sufficient, smaller non- 
singleton swarms can receive large multiplicative ben- 
efits from the seeder because their peers have available 
upload capacity to forward blocks. In contrast to the 
previous experiment, which examined the impact on a 
swarm at the tail end of the popularity distribution, Fig- 
ure 6 illustrates the impact of seeder bandwidth alloca- 
tion on a file of medium popularity. The figure shows the 
total amount of seeder bandwidth that Antfarm and Bit- 
Torrent allocate to a set of self-sufficient swarms, a new 
swarm of 5 peers, and 32 singleton swarms. It also shows 
the resulting average download bandwidths of peers in 
each swarm. The peers have 30 KB/s download capaci- 
ties and 20 KB/s upload capacities, and the self-sufficient 
swarms have peer interarrival times of 3, 6, 12, 24, 50, 
and 100 seconds. In the left-hand graph, BitTorrent ded- 
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Figure 6: BitTorrent versus Antfarm serving the middle of the popularity distribution. The shaded region indicates a new 
swarm of 5 peers. Swarms to its left are self-sufficient; swarms to its right are singletons. BitTorrent (left) starves the new swarm, 
favoring to dedicate bandwidth to the many peers in self-sufficient swarms. Antfarm (right) allocates enough seeder bandwidth 
to the new swarm to saturate its peers’ upload bandwidths, and allocates the rest to singleton swarms because they receive high 
marginal benefit. 
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Figure 7: Time versus bandwidth for Antfarm. The figures show seeder and aggregate bandwidths of the bimodal experiment 
with seeder bandwidths of 800 KB/s (left) and 80 KB/s (right). Antfarm follows drastically different bandwidth allocation strategies 
(dashed and dotted lines) to achieve high throughput (solid lines). 
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icates almost all of its bandwidth to the self-sufficient 
swarms, whose peers are already saturated, and some 
randomly to singleton swarms, which are unable to for- 
ward blocks. The right-hand graph shows that Antfarm 
awards enough bandwidth to the new swarm to saturate 
its peers’ uplinks and dedicates the rest of its bandwidth 
evenly among several singleton swarms because they re- 
ceive high marginal benefit. BitTorrent’s optimistic un- 
choking protocol causes it to dedicate its bandwidth to 
only a few singleton swarms over the 20 minute sim- 
ulation. Overall, Antfarm achieves an order of magni- 
tude increase in average download speed for the affected 
swarms without a corresponding penalty for the popular 
swarms. 


Figure 7 shows Antfarm’s bandwidth allocation over 
time to provide insight into Antfarm’s strategy. The left- 
hand graph shows that when seeder bandwidth is plenti- 
ful, Antfarm spends the vast majority of its bandwidth on 
small swarms since they receive the most marginal ben- 
efit. When seeder bandwidth is constrained, as shown in 


the right-hand graph, Antfarm achieves high aggregate 
bandwidth by preferentially seeding large swarms that 
can leverage their upload capacity to multiply the bene- 
fits from the seeder. As peers of the large swarm com- 
plete their downloads at 5000 seconds, the seeder shifts 
its bandwidth to the singleton swarms. The staircase be- 
havior is due to different swarms completing at different 
times. 

Overall, Antfarm qualitatively outperforms BitTorrent 
in a multi-torrent setting by allocating bandwidth based 
on dynamically measured response curves and preferen- 
tially serving those swarms that benefit most from seeder 
bandwith. 


4.2 PlanetLab Deployment 


We tested Antfarm’s performance through a Planet- 
Lab [5] deployment. To demonstrate Antfarm’s response 
curves in practice, Figure 8 shows a measured response 
curve of a swarm comprised of 25 PlanetLab nodes, each 
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Figure 8: A response curve for a swarm consisting of 

25 PlanetLab nodes, each with an upload capacity of 

50 KB/s. Each data point is based on the average swarm aggre- 


gate bandwidth over 10 minutes. Real-world response curves 
confirm simulations. 


with an upload capacity of 50 KB/s. The graph plots both 
the response scatterplot and the response curve as com- 
puted by the coordinator from the token exchange. The 
results confirm the simulations. 

Figure 9 compares the aggregate bandwidth achieved 
by Antfarm, BitTorrent, and traditional client-server 
downloads across 300 PlanetLab nodes, each with an up- 
load capacity of 50 KB/s. Swarms have size 100, 50, 25, 
12, 6, 3, and 1. In practice, the stock BitTorrent imple- 
mentation uploads only a few hand-picked files concur- 
rently; to evaluate BitTorrent in the presence of many 
swarms, we measured two values by running multiple 
seeder instances, each with its own upload capacity. Bit- 
Torrent Equal indicates the aggregate system bandwidth 
when the BitTorrent seeder splits its upload bandwidth 
equally among all swarms, including singleton swarms. 
BitTorrent Proportional shows performance when the 
seeder allocates to each swarm an upload bandwidth pro- 
portional to the size of the swarm. 

Antfarm outperforms BitTorrent by allocating its 
bandwidth to the swarms that receive the most benefit. 
Antfarm’s advantages over BitTorrent become more pro- 
nounced in systems with many swarms accompanied by 
relatively small seeder uplink capacities, a realistic sce- 
nario for a distribution center with a large number of files 
and a bandwidth bottleneck. In these experiments, Ant- 
farm outperforms traditional client-server by a factor of 
between 50 and 100, BitTorrent Equal by a factor of 8 
to 18, and BitTorrent Proportional by a factor of 1.2 to 3. 


4.3 Scalability 


In this section, we examine how the Antfarm coordinator 
scales. We examine the steady-state bandwidth cost of 
running a coordinator in a setting where peers download 
a file made up of 64 KB blocks with upload and down- 
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Figure 9: PlanetLab experiments showing aggregate band- 
width in Antfarm versus BitTorrent and client-server. 
300 PlanetLab nodes are distributed among swarms ranging in 
size from | to 100. Antfarm achieves high average performance 
by making efficient use of limited bandwidth. 
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Figure 10: Aggregate bandwidth of swarms managed by 
varying sizes of coordinator clusters. Each coordinator ma- 
chine runs on a PlanetLab node with an artificial bandwidth 
cap of 100 KB/s to limit scalability. The task of the token co- 
ordinator is embarrassingly parallel; the system capacity scales 
linearly with the size of the coordinator cluster. 


load capacities of 64 KB/s. 


Figure 10 shows the bandwidth consumption at the co- 
ordinator as a function of the number of peers based on 
experiments run on PlanetLab. In the experiment, the 
lead coordinator and each token coordinator ran on its 
own PlanetLab node, and peers were simulated across 
other PlanetLab nodes, engaging in the Antfarm proto- 
col without sending actual data. The results show that 
even for large numbers of peers, the bandwidth consump- 
tion at the coordinator is modest. A coordinator running 
on a single PlanetLab host suffices for deployments of 
80,000 peers or more. To demonstrate the scalability 
of the hierarchically distributed coordinator, we test a 
coordinator distributed across multiple PlanetLab hosts 
in a system with an aggregate bandwidth approaching 
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5 GB/s. To maximize generated load, peers omit the data 
exchange but engage in the token protocol with the coor- 
dinator. Further, we artificially limit the bandwidth avail- 
able to each physical coordinator node to 100 KB/s to 
gain insight into the performance of multiple coordinator 
nodes running with severe bandwidth constraints. The 
bottom curve shows the capacity of a single, artificially- 
bottlenecked coordinator node, which is able to handle 
the tokens and peer lists of approximately 9000 peers be- 
fore its performance reaches a plateau. Adding a second 
such coordinator node doubles the capacity of the sys- 
tem. Because the token coordinators engage in a mas- 
sively parallel task with little communication overhead, 
increasing the number of coordinators linearly increases 
the maximum supported number of peers. 


5 RELATED WORK 


There has been much past work on content distribution, 
which can be grouped roughly into work on content dis- 
tribution networks, token-based systems, and multicast 
and streaming systems. 

CDNs: Content distribution networks are scalable 
systems used to alleviate server load, reduce download 
times, and avoid network hotspots. Akamai [31], for ex- 
ample, is a widely deployed infrastructure-based CDN 
that many content providers rely on to distribute their 
content. Similarly, cooperative web caching [7,25,27,57, 
58] removes load from origin servers. ECHOS [34] pro- 
poses distributing servers using a peer-to-peer network of 
set-top boxes distributed at the Internet’s periphery, man- 
aged by a single entity that can optimize system perfor- 
mance, but does not address bandwidth management at 
the servers. Although distributed CDNs scale, the band- 
width cost of operating them resides entirely with the 
content provider and distributor. 

Peer-to-peer CDNs effectively shift bandwidth costs 
from the content provider to clients. BitTorrent [8] is one 
of the most popular client-based peer-to-peer CDNs, and 
studies consistently show that BitTorrent traffic consti- 
tutes a significant fraction of Internet traffic [43,53]. Pi- 
atek et al. [46] augment the BitTorrent protocol to enable 
peers to share reputation information through one level 
of intermediary nodes; it does not address the issue of 
multiple swarms. CoBlitz [44] is an HTTP-based content 
distribution network that splits a file into chunks, which 
are cached at distributed nodes. Choffnes et al. [15] re- 
duce cross-ISP traffic in peer-to-peer systems by harvest- 
ing data from existing CDNs for locality information. 
Shark [3] and ChunkCast [9] reduce client-perceived 
download latency via a structured overlay, and Coral [23] 
and Bamboo [50] assist clients in finding nearby copies 
of data. Antfarm similarly shifts cost to clients; however, 


it retains control of network behavior by carefully allo- 
cating bandwidth to each swarm. 

Further, many systems such as the Data Oriented 
Transfer (DOT) architecture [42,54] use peer-to-peer 
swarming to speed up downloads. 

Token-based Incentives: Early model and analysis 
by Qiu and Srikant [49] of BitTorrent’s incentive mech- 
anism showed that the system converges to a Nash equi- 
librium where all peers upload at their capacity. How- 
ever, more recent work, BitTyrant [45], BitThief [35], 
and Sirivianos et al. [51], has demonstrated that average 
download times currently depend on significant altruism 
from high capacity peers that, when withheld, reduces 
performance for all users. Further, BitTorrent’s tit-for- 
tat mechanism only operates within an individual swarm; 
it does not provide information on how to allocate re- 
sources, such as seeder bandwidth, among swarms. 

Dandelion [52] and BAR gossip [36] avoid the prob- 
lem of relying on altruism to distribute data. They 
use a cryptographic fair exchange mechanism that re- 
quires a client to upload content to other clients in ex- 
change for virtual credit, which can be redeemed for fu- 
ture service. Microcurrencies [10, 37, 47,59] similarly 
rely on cryptographically protected tokens for fair re- 
source exchange, and optionally provide additional fea- 
tures such as spender anonymity. Antfarm’s token sys- 
tem is domain-specific and significantly lighter-weight 
than these approaches. 

Decentralized resource allocation in peer-to-peer sys- 
tems requires incentives for participants to contribute re- 
sources. Ngan et al. [39] suggest cooperative audits to 
ensure that participants contribute storage commensurate 
with their usage. Samsara [16] considers storage allo- 
cation in a peer-to-peer storage system and introduces 
cryptographically signed storage claims to ensure that 
any user of remote storage devotes a like amount of stor- 
age locally. Both techniques center around audits of re- 
sources that are spatial in nature. 

Karma [56] and SHARP [22] resource allocation can 
apply to renewable resources such as bandwidth. Karma 
employs a global credit bank, with which clients main- 
tain accounts. The value of a client’s account increases 
when it contributes and decreases when it consumes. A 
client can only consume resources if its account con- 
tains sufficient credit. SHARP operates at the granular- 
ity of autonomous systems or sites. To join the system a 
SHARP site must negotiate resource contracts with one 
or more existing group members. These contracts, in ef- 
fect, specify the system’s expectations of the site and the 
site’s promise of available resources to the system. Ac- 
countable claims make it possible to monitor each partic- 
ipant’s compliance with its contracts, simplifying audits 
and making collusion more difficult in SHARP relative 
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to other decentralized peer-to-peer systems. 

Streaming and Multicast: Multicast and streaming 
are alternative designs for distributing content. For in- 
stance, the seminal work by Deering proposed IP mul- 
ticast to efficiently deliver content to multiple destina- 
tions [20]. Deployment difficulties with global IP multi- 
cast [18] led to application-level multicast systems such 
as End System Multicast [14], Your Own Internet Distri- 
bution (YOID) [21], and others [60]. 

Several techniques have been proposed to dis- 
tribute data efficiently using application-level multi- 
cast. Overcast [26] distributes content by construct- 
ing a bandwidth-optimized overlay tree among dedicated 
infrastructure nodes. SplitStream [13] distributes con- 
tent via a peer-to-peer overlay that disseminates content 
along branches of trees constructed on top of a peer-to- 
peer substrate. Bullet [29] and Bullet’ [28] also use a ran- 
domized overlay mesh to distribute data. Chainsaw [48] 
is a peer-to-peer multicast based on an unstructured over- 
lay mesh in which peers explicitly request packets from 
neighbors. This mechanism ensures that peers are able to 
receive all packets and avoid receiving duplicate packets. 
ChunkySpread [55] is a hybrid that uses both structured 
and unstructured overlays to distribute content. Antfarm 
differs from streaming multicast systems in that it aims to 
maximize aggregate system bandwidth for multiple con- 
current batch downloads. 

Another set of work proposes augmenting BitTorrent- 
like protocols to accommodate streaming video in a peer- 
to-peer setting. BASS [17] exemplifies this approach by 
adding peer-to-peer interactions to a client-server model 
where peers stream video from the server while trading 
blocks with other peers to alleviate load on the server 
in the future. Antfarm also incorporates a peer-to-peer 
protocol to alleviate load, but manages the interactions 
via the coordinator to achieve high throughput for mul- 
tiple swarms. Siddhartha et al. [4] propose a BitTorrent- 
like protocol with small neighborhoods of topographi- 
cally close peers for exchanging blocks, using heuristics 
to handle swarms of heterogeneous link capacities. 

Finally, many streaming and multicast architectures 
use coding to increase content delivery reliability [2, 
6, 24,38]. Integrating coding techniques into Antfarm 
could further improve performance. 


6 CONCLUSIONS 


In this paper we introduced Antfarm, a peer-to-peer con- 
tent distribution system for the batch dissemination of 
large files. Antfarm explores a novel space in the de- 
sign of swarming protocols; whereas past systems avoid 
all vestiges of centralization for both technical and legal 
reasons and suffered from lack of coordination across 
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swarms, Antfarm examines how modest planning by 
a centralized coordinator can help a set of competing 
swarms achieve high performance. 

The key to Antfarm’s performance is its restatement 
of the download management task as an optimization 
problem. The hill-climbing algorithm we propose effec- 
tively leverages available bandwidth, accommodates de- 
sired minimum bandwidth limits, avoids starvation, and 
enforces desired swarm priorities. The wire-level pro- 
tocol enables performance information to be extracted 
from the network, enabling a practical deployment that 
reacts to changing network and swarm conditions. Even 
though the approach embodies a logically centralized co- 
ordinator, the computational requirements of the coordi- 
nator are modest, the bandwidth requirement is feasibly 
small, and the coordinator carries out an embarrassingly 
parallel task that is easy to replicate across datacenters. 
PlanetLab deployments and simulations indicate that the 
system is practical, scalable, and capable of achieving 
significantly higher bandwidth than previous approaches. 
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Abstract 


We present HashCache, a configurable cache storage 
engine designed to meet the needs of cache storage 
in the developing world. With the advent of cheap 
commodity laptops geared for mass deployments, de- 
veloping regions are poised to become major users of 
the Internet, and given the high cost of bandwidth in 
these parts of the world, they stand to gain signifi- 
cantly from network caching. However, current Web 
proxies are incapable of providing large storage capac- 
ities while using small resource footprints, a requirement 
for the integrated multi-purpose servers needed to ef- 
fectively support developing-world deployments. Hash- 
Cache presents a radical departure from the conventional 
wisdom in network cache design, and uses 6 to 20 times 
less memory than current techniques while still provid- 
ing comparable or better performance. As such, Hash- 
Cache can be deployed in configurations not attainable 
with current approaches, such as having multiple ter- 
abytes of external storage cache attached to low-powered 
machines. HashCache has been successfully deployed in 
two locations in Africa, and further deployments are in 
progress. 


1 Introduction 


Network caching has been used in a variety of contexts 
to reduce network latency and bandwidth consumption, 
ranging from FTP caching [31], Web caching [15, 4], re- 
dundant traffic elimination [20, 28, 29], and content dis- 
tribution [1, 10, 26, 41]. All of these cases use local 
storage, typically disk-based, to reduce redundant data 
fetches over the network. Large enterprises and ISPs 
particularly benefit from network caches, since they can 
amortize their cost and management over larger user pop- 
ulations. Cache storage system design has been shaped 
by this class of users, leading to design decisions that fa- 
vor first-world usage scenarios. For example, RAM con- 
sumption is proportional to disk size due to in-memory 


indexing of on-disk data, which was developed when 
disk storage was relatively more expensive than it is now. 
However, because disk size has been growing faster than 
RAM sizes, it is now much cheaper to buy terabytes of 
disk than a machine capable of indexing that much stor- 
age, since most low-end servers have lower memory lim- 
its. 

This disk/RAM linkage makes existing cache storage 
systems problematic for developing world use, where it 
may be very desirable to have terabytes of cheap stor- 
age (available for less than US $100/TB) attached to 
cheap, low-power machines. However, if indexing a ter- 
abyte of storage requires 10 GB of RAM (typical for 
current proxy caches), then these deployments will re- 
quire server-class machines, with their associated costs 
and infrastructure. Worse, this memory is dedicated for 
use by a single service, making it difficult to deploy con- 
solidated multi-purpose servers. When low-cost laptops 
from the One Laptop Per Child project [22] or the Class- 
mate from Intel [13] cost only US $200 each, spending 
thousands of dollars per server may exceed the cost of 
laptops for an entire school. 

This situation is especially unfortunate, since band- 
width in developing regions is often more expensive, 
both in relative and absolute currency, than it is in the 
US and Europe. Africa, for example, has poor terrestrial 
connectivity, and often uses satellite connectivity, back- 
hauled through Europe. One of our partners in Nigeria, 
for example, shares a 2 Mbps link, which costs $5000 per 
month. Even the recently-planned “Google Satellite,” the 
O3b, is expected to drop the cost to only $500/Mbps per 
month by 2010 [21]. With efficient cache storage, one 
can reduce the network connectivity expenses. 

The goal of this project is to develop network cache 
stores designed for developing-world usage. In this pa- 
per, we present HashCache, a configurable storage sys- 
tem that implements flexible indexing policies, all of 
which are dramatically more efficient than traditional 
cache designs. The most radical policy uses no main 
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memory for indexing, and obtains performance compa- 
rable to traditional software solutions such as the Squid 
Web proxy cache. The highest performance policy per- 
forms equally with commercial cache appliances while 
using main-memory indexes that are only one-tenth their 
size. Between these policies are a range of distinct poli- 
cies that trade memory consumption for performance 
suitable for a range of workloads in developing regions. 


1.1 Rationale For a New Cache Store 


HashCache is designed to serve the needs of developing- 
world environments, starting with classrooms but work- 
ing toward backbone networks. In addition to good per- 
formance with low resource consumption, HashCache 
provides a number of additional benefits suitable for 
developing-world usage: (a) many HashCache policies 
can be tailored to use main memory in proportion to sys- 
tem activity, instead of cache size; (b) unlike commer- 
cial caching appliances, HashCache does not need to be 
the sole application running on the machine; (c) by sim- 
ply choosing the appropriate indexing scheme, the same 
cache software can be configured as a low-resource end- 
user cache appropriate for small classrooms, as well as 
a high-performance backbone cache for higher levels of 
the network; (d) in its lowest-memory configurations, 
HashCache can run on laptop-class hardware attached to 
external multi-terabyte storage (via USB, for example), a 
scenario not even possible with existing designs; and (e) 
HashCache provides a flexible caching layer, allowing it 
to be used not only for Web proxies, but also for other 
cache-oriented storage systems. 

A previous analysis of Web traffic in developing re- 
gions shows great potential for improving Web perfor- 
mance [8]. According to the study, kiosks in Ghana and 
Cambodia, with 10 to 15 users per day, have downloaded 
over 100 GB of data within a few months, involving 12 
to 14 million URLs. The authors argue for the need 
for applications that can perform HTTP caching, chunk 
caching for large downloads and other forms of caching 
techniques to improve the Web performance. With the 
introduction of personal laptops into these areas, it is rea- 
sonable to expect even higher network traffic volumes. 

Since HashCache can be shared by many applications 
and is not HT'TP-specific, it avoids the problem of dimin- 
ishing returns seen with large HTTP-only caches. Hash- 
Cache can be used by both a Web proxy and a WAN ac- 
celerator, which stores pieces of network traffic to pro- 
vide protocol-independent network compression. This 
combination allows the Web cache to store static Web 
content, and then use the WAN accelerator to reduce 
redundancy in dynamically-generated content, such as 
news sites, Wikipedia, or even locally-generated content, 
all of which may be marked uncacheable, but which tend 
to only change slowly over time. While modern Web 
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pages may be large, they tend to be composed of many 
small objects, such as dozens of small embedded images. 
These objects, along with tiny fragments of cached net- 
work traffic from a WAN accelerator, put pressure on tra- 
ditional caching approaches using in-memory indexing. 

A Web proxy running on a terabyte-sized HashCache 
can provide a large HTTP store, allowing us to not only 
cache a wide range of traffic, but also speculatively pre- 
load content during off-peak hours. Furthermore, this 
kind of system can be driven from a typical OLPC-class 
laptop, with only 256MB of total RAM. One such lap- 
top can act as a cache server for the rest of the laptops in 
the deployment, eliminating the need for separate server- 
class hardware. In comparison, using current Web prox- 
ies, these laptops could only index 30GB of disk space. 

The rest of this paper is structured as follows. Sec- 
tion 2 explains the current state of the art in network 
storage design. Section 3 explains the problem, explores 
a range of HashCache policies, and analyzes them. Sec- 
tion 4 describes our implementation of policies and the 
HashCache Web proxy. Section 5 presents the perfor- 
mance evaluation of the HashCache Web Proxy and com- 
pares it with Squid and a modern high-performance sys- 
tem with optimized indexing mechanisms. Section 6 de- 
scribes the related work, Section 7 describes our current 
deployments, and Section 8 concludes with our future 
work. 


2 Current State-of-the-Art 


While typical Web proxies implement a number of fea- 
tures, such as HTTP protocol handling, connection man- 
agement, DNS and in-memory object caching, their per- 
formance is generally dominated by their filesystem or- 
ganization. As such, we focus on the filesystem com- 
ponent because it determines the overall performance 
of a proxy in terms of the peak request rate and object 
cacheability. With regard to filesystems, the two main 
optimizations employed by proxy servers are hashing 
and indexing objects by their URLs, and using raw disk 
to bypass filesystem inefficiencies. We discuss both of 
these aspects below. 

The Harvest cache [4] introduced the design of stor- 
ing objects by a hash of their URLs, and keeping an in- 
memory index of objects stored on disk. Typically, two 
levels of subdirectories were created, with the fan-out of 
each level configurable. The high-order bits of the hash 
were used to select the appropriate directories, and the 
file was ultimately named by the hash value. This ap- 
proach not only provided a simple file organization, but 
it also allowed most queries for the presence of objects to 
be served from memory, instead of requiring disk access. 
The older CERN [15] proxy, by contrast, stored objects 
by creating directories that matched the components of 
the URL. By hashing the URL, Harvest was able to con- 
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Table 1: System Entities for Web Caches 


trol both the depth and fan-out of the directories used 
to store objects. The CERN proxy, Harvest, and its de- 
scendant, Squid, all used the filesystems provided by the 
operating system, simplifying the proxy and eliminating 
the need for controlling the on-disk layout. 

The next step in the evolution of proxy design was us- 
ing raw disk and custom filesystems to eliminate multiple 
levels of directory traversals and disk head seeks associ- 
ated with them. The in-memory index now stored the 
location on disk where the object was stored, eliminating 
the need for multiple seeks to find the start of the object. ! 

The first block of the on-disk file typically includes 
extra metadata that is too big to be held in memory, such 
as the complete URL, full response headers, and location 
of subsequent parts of the object, if any, and is followed 
by the content fetched from the origin server. In order to 
fully utilize the disk writing throughput, those blocks are 
often maintained consecutively, using a technique simi- 
lar to log-structured filesystem(LFS) [30]. Unlike LFS, 
which is expected to retain files until deleted by the user, 
cache filesystems can often perform disk cache replace- 
ment in LIFO order, even if other approaches are used 
for main memory cache replacement. Table 1 summa- 
rizes the object lookup and storage management of vari- 
ous proxy implementations that have been used to build 
Web caches. 

The upper bound on the number of cacheable objects 
per proxy is a function of available disk cache and phys- 
ical memory size. Attempting to use more memory than 
the machine’s physical memory can be catastrophic for 
caches, since unpredictable page faults in the applica- 
tion can degrade performance to the point of unusabil- 
ity. When these applications run as a service at network 
access points, which is typically the case, all users then 
suffer extra latency when page faults occur. 

The components of the in-memory index vary from 
system to system, but a representative configuration for 
a high-performance proxy is given in Table 2. Each 
entry has some object-specific information, such as its 
hash value and object size. It also has some disk-related 


'This information was previously available on the iMimic Network- 
ing Web site and the Volera Cache Web site, but both have disappeared. 
No citable references appear to exist 
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information, such as the location on disk, which disk, 
and which generation of log, to avoid problems with log 
wrapping. The entries typically are stored in a chain per 
hash bin, and a doubly-linked LRU list across all index 
entries. Finally, to shorten hash bin traversals (and the 
associated TLB pressure), the number of hash bins is typ- 
ically set to roughly the number of entries. 


Using these fields and their sizes, the total consump- 
tion per index entry can be as low as 32 bytes per object, 
but given that the average Web object is roughly 8KB 
(where a page may have tens of objects), even 32 bytes 
per object represents an in-memory index storage that is 
1/256 the size of the on-disk storage. With a more re- 
alistic index structure, which can include a larger hash 
value, expiration time, and other fields, the index entry 
can be well over 80 bytes (as in the case of Squid), caus- 
ing the in-memory index to exceed 1% of the on-disk 
storage size. With a single 1TB drive, the in-memory in- 
dex alone would be over 1OGB. Increasing performance 
by using multiple disks would then require tens of giga- 
bytes of RAM. 


Reducing the RAM needed for indexing is desirable 
for several scenarios. Since the growth in disk capaci- 
ties has been exceeding the growth of RAM capacity for 
some time, this trend will lead to systems where the disk 
cannot be fully indexed due to a lack of RAM. Dedicated 
RAM also effectively limits the degree of multiprogram- 
ming of the system, so as processors get faster relative 
to network speeds, one may wish to consolidate multi- 
ple functions on a single server. WAN accelerators, for 
example, cache network data [5, 29, 34], so having very 
large storage can reduce bandwidth consumption more 
than HTTP proxies alone. Similarly, even in HTTP prox- 
ies, RAM may be more useful as a hot object cache than 
as an index, as is the case in reverse proxies (server ac- 
celerators) and content distribution networks. One goal 
in designing HashCache is to determine how much index 
memory is really necessary. 
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Figure 1: HashCache-Basic: objects with hash value 1 go 
to the i*” bin for the first block of a file. Later blocks are 
in the circular log. 


3 Design 


In this section, we present the design of HashCache 
and show how performance can be scaled with avail- 
able memory. We begin by showing how to eliminate the 
in-memory index while still obtaining reasonable perfor- 
mance, and then we show how selective use of minimal 
indexing can improve performance. A summary of poli- 
cies 1s shown in Table 3. 


3.1 Removing the In-Memory Index 

We start by removing the in-memory index entirely, and 
incrementally introducing minimal metadata to system- 
atically improve performance. To remove the in-memory 
index, we have to address the two functions the in- 
memory index serves: indicating the existence of an ob- 
ject and specifying its location on disk. Using filesys- 
tem directories to store objects by hash has its own per- 
formance problems, so we seek an alternative solution — 
treating the disk as a simple hashtable. 

HashCache-Basic, the simplest design option in the 
HashCache family, treats part of the disk as a fixed-size, 
non-chained hash table, with one object stored in each 
bin. This portion is called the Disk Table. It hashes the 
object name (a URL in the case of a Web cache) and then 
calculates the hash value modulo the number of bins to 
determine the location of the corresponding file on disk. 
To avoid false positives from hash collisions, each stored 
object contains metadata, including the original object 
name, which is compared with the requested object name 
to confirm an actual match. New objects for a bin are 
simply written over any previous object. 

Since objects may be larger than the fixed-size bins 
in the Disk Table, we introduce a circular log that con- 
tains the remaining portion of large objects. The object 
metadata stored in each Disk Table bin also includes the 
location in the log, the object size, and the log generation 
number, and is illustrated in Figure 1. 

The performance impact of these decisions is as 
follows: in comparison to high-performance caches, 
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th 
search through the = set for the first block of a file. 
Later blocks are in the circular log. Some arrows are 
shown crossed to illustrate that objects that map on to a 


set can be placed anywhere in the set. 


HashCache-Basic will have an increase in hash collisions 
(reducing cache hit rates), and will require a disk access 
on every request, even cache misses. Storing objects will 
require one seek per object (due to the hash randomiz- 
ing the location), and possibly an additional write to the 
circular log. 


3.2 Collision Control Mechanism 


While in-memory indexes can use hash chaining to elim- 
inate the problem of hash values mapped to the same bin, 
doing so for an on-disk index would require many ran- 
dom disk seeks to walk a hash bin, so we devise a sim- 
pler and more efficient approach while retaining most of 
the benefits. 

In HashCache-Set, we expand the Disk Table to be- 
come an N-way set-associative hash table, where each 
bin can store N elements. Each element still contains 
metadata with the full object name, size, and location in 
the circular log of any remaining part of the object. Since 
these locations are contiguous on disk, and since short 
reads have much lower latency than seeks, reading all of 
the members of the set takes only marginally more time 
than reading just one element. This approach is shown in 
Figure 2, and reduces the impact of popular objects map- 
ping to the same hash bin, while only slightly increasing 
the time to access an object. 

While HashCache-Set eliminates problems stemming 
from collisions in the hash bins, it still has several prob- 
lems: it requires disk access for cache misses, and lacks 
an efficient mechanism for cache replacement within the 
set. Implementing something like LRU within the set us- 
ing the on-disk mechanism would require a potential disk 
write on every cache hit, reducing performance. 


3.3. Avoiding Seeks for Cache Misses 


Requiring a disk seek to determine a cache miss is a ma- 
jor issue for workloads with low cache hit rates, since an 
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with Squid and commercial entries included for comparison. 


Main memory consumption values assume an average object size of 8KB. Squid memory data appears in 
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index-less cache would spend most of its disk time con- 
firming cache misses. This behavior would add extra la- 
tency for the end-user, and provide no benefit. To address 
the problem of requiring seeks for cache misses, we in- 
troduce the first HashCache policy with any in-memory 
index, but employ several optimizations to keep the in- 
dex much smaller than traditional approaches. 

As a Starting point, we consider storing in main mem- 
ory an H-bit hash values for each cached object. These 
hash values can be stored in a two-dimensional array 
which corresponds to the Disk Table, with one row for 
each bin, and N columns corresponding to the N-way 
associativity. An LRU cache replacement policy would 
need forward and reverse pointers per object to maintain 
the LRU list, bringing the per-object RAM cost to (H + 
64) bits assuming 32-bit pointers. However, we can re- 
duce this storage as follows. 

First, we note that all the entries in an N-entry set share 
the same modulo hash value (%S) where S is the number 
of sets in the Disk Table. We can drop the lowest log(S) 
bits from each hash value with no loss, reducing the hash 
storage to only H - log(S) bits per object. 

Secondly, we note that cache replacement policies 
only need to be implemented within the N-entry set, so 
LRU can be implemented by simply ranking the entries 
from 0 to N-1, thereby using only log(N) bits per entry. 

We can further choose to keep in-memory indexes for 
only some sets, not all sets, so we can restrict the number 
of in-memory entries based on request rate, rather than 
cache size. This approach keeps sets in an LRU fashion, 
and fetches the in-memory index for a set from disk on 
demand. By keeping only partial sets, we need to also 
keep a bin number with each set, LRU pointers per set, 
and a hash table to find a given set in memory. 

Deciding when to use a complete two-dimensional ar- 
ray versus partial sets with bin numbers and LRU point- 
ers depends on the size of the hash value and the set as- 
sociativity. Assuming 8-way associativity and the 8 most 


significant hash bits per object, the break-even point is 
around 50% — once more than half the sets will be stored 
in memory, it is cheaper to remove the LRU pointers and 
bin number, and just keep all of the sets. A discussion of 
how to select values for these parameters is provided in 
Section 4. 

If the full array is kept in memory, we call it 
HashCache-SetMem, and if only a subset are kept in 
memory, we call it HashCache-SetMemLRU. With a 
low hash collision rate, HashCache-SetMem can deter- 
mine most cache misses without accessing disk, whereas 
HashCache-SetMemLRU, with its tunable memory con- 
sumption, will need disk accesses for some fraction of 
the misses. However, once a set is in memory, per- 
forming intra-set cache replacement decisions requires 
no disk access for policy maintenance. Writing objects 
to disk will still require disk access. 


3.4 Optimizing Cache Writes 


With the previous optimizations, cache hits require one 
seek for small files, and cache misses require no seeks 
(excluding false positives from hash collisions) if the as- 
sociated set’s metadata is in memory. Cache writes still 
require seeks, since object locations are dictated by their 
hash values, leaving HashCache at a performance dis- 
advantage to high-performance caches that can write all 
content to a circular log. This performance problem is 
not an issue for caches with low request rates, but will 
become a problem for higher request rate workloads. 

To address this problem, we introduce a new pol- 
icy, HashCache-Log, that eliminates the Disk Table and 
treats the disk as a log, similar to the high-performance 
caches. For some or all objects, we store an additional 
offset (32 or 64 bits) specifying the location on disk. We 
retain the N-way set associativity and per-set LRU re- 
placement because they eliminate disk seeks for cache 
misses with compact implementation. While this ap- 
proach significantly increases memory consumption, it 
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can also yield a large performance advantage, so this 
tradeoff is useful in many situations. However, even 
when adding the log location, the in-memory index is 
still much smaller than traditional caches. For exam- 
ple, for 8-way set associativity, per-set LRU requires 3 
bits per entry, and 8 bits per entry can minimize hash 
collisions within the set. Adding a 32-bit log position 
increases the per-entry size from 11 bits to 43 bits, but 
virtually eliminates the impact of write traffic, since all 
writes can now be accumulated and written in one disk 
seek. Additionally, we need a few bits (assume 4) to 
record the log generation number, driving the total to 47 
bits. Even at 47 bits per entry, HashCache-Log still uses 
indexes that are a factor of 6-12 times smaller than cur- 
rent high-performance proxies. 

We can reduce this overhead even further if we ex- 
ploit Web object popularity, where half of the objects are 
rarely, if ever, re-referenced [8]. In this case, we can 
drop half of the log positions from the in-memory index, 
and just store them on disk, reducing the average per- 
entry size to only 31 bits, for a small loss in performance. 
HashCache-LogLRU allows the number of log pon 
entries per set to be configured, typically using * log 
positions per N-object set. The remaining log offsets in 
the set are stored on the disk as a small contiguous file. 
Keeping this file and the in-memory index in sync re- 
quires a few writes reducing the performance by a small 
amount. The in-memory index size, in this case, is 9-20 
times smaller than traditional high-performance systems. 


3.5 Prefetching Cache Reads 


With all of the previous optimizations, caching storage 
can require as little as 1 seek per object read for small 
objects, with no penalty for cache misses, and virtually 
no cost for cache writes that are batched together and 
written to the end of the circular log. However, even 
this performance can be further improved, by noting that 
prefetching multiple objects per read can amortize the 
read cost per object. 

Correlated access can arise in situations like Web 
pages, where multiple small objects may be embedded 
in the HTML of a page, resulting in many objects being 
accessed together during a small time period. Grouping 
these objects together on disk would reduce disk seeks 
for reading and writing. The remaining blocks for these 
pages can all be coalesced together in the log and written 
together so that reading them can be faster, ideally with 
one seek. 

The only change necessary to support this policy is 
to keep a content length (in blocks) for all of the re- 
lated content written at the same time, so that it can be 
read together in one seek. When multiple related objects 
are read together, the system will perform reads at less 
than one seek per read on average. This approach can 
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Table 4: Throughputs for techniques, rr = peak request 
rate, chr = cache hit rate, cbr = cacheability rate, rel = 
average number of related objects, t = peak disk seek rate 
— all calculations include read prefetching, so the results 
for Log and Grouped are the same. To exclude the effects 
of read prefetching, simply set rel to one. 





be applied to many of the previously described Hash- 
Cache policies, and only requires that the application us- 
ing HashCache provide some information about which 
objects are related. Assuming prefetch lengths of no 
more than 256 blocks, this policy only requires 8 bits 
per index entry being read. In the case of HashCache- 
LogLRU, only the entries with in-memory log position 
information need the additional length information. Oth- 
erwise, this length can also be stored on disk. As a result, 
adding this prefetching to HashCache-LogLRU only in- 
creases the in-memory index size to 35 bits per object, 
assuming half the entries of each set contain a log posi- 
tion and prefetch length. 

For the rest of this paper, we assume all the policies to 
have this optimization except HashCache-LogN which is 
the HashCache-Log policy without any prefetching. 


3.6 Expected Throughput 


To understand the throughput implications of the vari- 
ous HashCache schemes, we analyze their expected per- 
formance under various conditions using the parameters 
shown in Table 4. 

The maximum request rate(rr) is a function of the 
disk seek rate, the hit rate, the miss rate, and the write 
rate. The write rate is required because not all objects 
that are fetched due to cache misses are cacheable. Ta- 
ble 4 presents throughputs for each system as a function 
of these parameters. The cache hit rate(chr) is simply a 
number between 0 and 1, as is the cacheability rate (cbr). 
Since the miss rate is (1 - chr), the write rate can be rep- 
resented as (1 - chr) - cbr. The peak disk seek rate(t) 
is a measured quantity that is hardware-dependent, and 
the average number of related objects(rel) is always a 
positive number. Due to space constraints, we omit the 
derivations for these calculations. These throughputs are 
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conservative estimates because we do not take into ac- 
count the in-memory hot object cache, where some por- 
tion of the main memory 1s used as a cache for frequently 
used objects, which can further improve throughput. 


4 HashCache Implementation 


We implement a common HashCache filesystem I/O 
layer so that we can easily use the same interface with 
different applications. We expose this interface via 
POSIX-like calls, such as open(), readQ), write(), closeQ), 
seek(), etc., to operate on files being cached. Rather than 
operate directly on raw disk, HashCache uses a large file 
in the standard Linux ext2/ext3 filesystem, which does 
not require root privilege. Creating this zero-filled large 
file on a fresh ext2/ext3 filesystem typically creates a 
mostly contiguous on-disk layout. It creates large files 
on each physical disk and multiplexes them for perfor- 
mance. The HashCache filesystem is used by the Hash- 
Cache Web proxy cache as well as other applications we 
are developing. 


4.1 External Indexing Interface 


HashCache provides a simple indexing interface to sup- 
port other applications. Given a key as input, the inter- 
face returns a data structure containing the file descrip- 
tors for the Disk Table file and the contiguous log file 
(if required), the location of the requested content, and 
metadata such as the length of the contiguous blocks be- 
longing to the item, etc. We implement the interface for 
each indexing policy we have described in the previous 
section. Using the data returned from the interface one 
can utilize the POSIX calls to handle data transfers to 
and from the disk. Calls to the interface can block if disk 
access is needed, but multiple calls can be in flight at the 
same time. The interface consists of roughly 600 lines of 
code, compared to 21000 lines for the HashCache Web 
Proxy. 


4.2 HashCache Proxy 


The HashCache Web Proxy is implemented as an 
event-driven main process with cooperating helper pro- 
cesses/threads handling all blocking operations, such as 
DNS lookups and disk I/Os, similar to the design of 
Flash [25]. When the main event loop receives a URL re- 
quest from a client, it searches the in-memory hot-object 
cache to see if the requested content is already in mem- 
ory. In case of a cache miss, it looks up the URL us- 
ing one of the HashCache indexing policies. Disk I/O 
helper processes use the HashCache filesystem I/O inter- 
face to read the object blocks into memory or to write 
the fetched object to disk. To minimize inter-process 
communication (IPC) between the main process and the 
helpers, only beacons are exchanged on IPC channels 
and the actual data transfer is done via shared memory. 


4.3. Flexible Memory Management 


HTTP workloads will often have a small set of objects 
that are very popular, which can be cached in main mem- 
ory to serve multiple requests, thus saving disk I/O. Gen- 
erally, the larger the in-memory cache, the better the 
proxy’s performance. HashCache proxies can be config- 
ured to use all the free memory on a system without un- 
duly harming other applications. To achieve this goal, we 
implement the hot object cache via anonymous mmap ( ) 
calls so that the operating system can evict pages as 
memory pressure dictates. Before the HashCache proxy 
uses the hot object cache, it checks the memory residency 
of the page via the mincore() system call, and sim- 
ply treats any missing page as a miss in the hot object 
cache. The hot object cache is managed as an LRU list 
and unwanted objects or pages no longer in main mem- 
ory can be unmapped. This approach allows the Hash- 
Cache proxy to use the entire main memory when no 
other applications need it, and to seamlessly reduce its 
memory consumption when there is memory pressure in 
the system. 

In order to maximize the disk writing throughput, the 
HashCache proxy buffers recently-downloaded objects 
so that many objects can be written in one batch (often 
to acircular log). These dirty objects can be served from 
memory while waiting to be written to disk. This dirty 
object cache reduces redundant downloads during flash 
crowds because many popular HTTP objects are usually 
requested by multiple clients. 

HashCache also provides for grouping related objects 
to disk so that they can be read together later, providing 
the benefits of prefetching. The HashCache proxy uses 
this feature to amortize disk seeks over multiple objects, 
thereby obtaining higher read performance. One com- 
mercial system parses HTML to explicitly find embed- 
ded objects [7], but we use a simpler approach — simply 
grouping downloads by the same client that occur within 
a small time window and that have the same HTTP Re- 
ferrer field. We have found that this approach works well 
in practice, with much less implementation complexity. 


4.4 Parameter Selection 


For the implementation, we choose some design param- 
eters such as the block size, the set size, and the hash 
size. Choosing the block size is a tradeoff between space 
usage and the number of seeks necessary to read small 
objects. In Table 5, we show an analysis of object sizes 
from a live, widely-used Web cache called CoDeeN [41]. 
We see that nearly 75% of objects are less than 8KB, 
while 87.2% are less than 16K B. Choosing an 8KB block 
would yield better disk usage, but would require multiple 
seeks for 25% of all objects. Choosing the larger block 
size wastes some space, but may increase performance. 
Since the choice of block size influences the set size, 
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Size (KB) | % of objects < size 


Read Size (KB) Latency/seek (ms) 
1 78 





Table 5: CDF of Web object sizes 


we make the decisions based on the performance of cur- 
rent disks. Table 6 shows the average number of seeks 
per second of three recent SATA disks (18, 60 and 150 
GB each). We notice the sharp degradation beyond 
64KB, so we use that as the set size. Since 64KB can 
hold 4 blocks of 16KB each or 8 blocks of 8KB each, we 
opt for an 8KB block size to achieve 8-way set associa- 
tivity. With 8 objects per set, we choose to keep 8 bits 
of hash value per object for the in-memory indexes, to 
reduce the chance of collisions. This kind of an analy- 
sis can be automatically performed during initial system 
configuration, and are the only parameters needed once 
the specific HashCache policy is chosen. 


5 Performance Evaluation 


In this section, we present experimental results that com- 
pare the performance of different indexing mechanisms 
presented in Section 3. Furthermore, we present a 
comparison between the HashCache Web Proxy Cache, 
Squid, and a high-performance commercial proxy called 
Tiger, using various configurations. Tiger implements 
the best practices outlined in Section 2 and is currently 
used in commercial service [6]. We also present the im- 
pact of the optimizations that we included in the Hash- 
Cache Web Proxy Cache. For fair comparison, we use 
the same basic code base for all the HashCache variants, 
with differences only in the indexing mechanisms. 


5.1 Workload 


To evaluate these systems, we use the Web Poly- 
graph [37] benchmarking tool, the de facto industry stan- 
dard for testing the performance of HTTP intermediaries 
such as content filters and caching proxies. We use the 
Polymix [38] environment models, which models many 
key Web traffic characteristics, including: multiple con- 
tent types, diurnal load spikes, URLs with transient pop- 
ularity, a global URL set, flash crowd behavior, an un- 
limited number of objects, DNS names in URLs, object 
life-cycles (expiration and last-modification times), per- 
sistent connections, network packet loss, reply size vari- 
ations, object popularity (recurrence), request rates and 
inter-arrival times, embedded objects and browser behav- 
ior, and cache validation (If-Modified-Since requests and 
reloads). 
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Table 6: Disk performance statistics 


We use the latest standard workload, Polymix-4 [38], 
which was used at the Fourth Cache-off event [39] to 
benchmark many proxies. The Polygraph test harness 
uses several machines for emulating HTTP clients and 
others to act as Web servers. This workload offers a 
cache hit ratio (CHR) of 60% and a byte hit ratio (BHR) 
of 40% meaning that at most 60% of the objects are 
cache hits while 40% of bytes are cache hits. The aver- 
age download latency is 2.5 seconds (including RTT). A 
large number of objects are smaller than 8.5 KB. HTML 
pages contain 10 to 20 embedded (related) objects, with 
an average size of 5 to 10 KB. A small number (0.1 %) 
of large downloads (300 KB or more) have higher cache 
hit rates. These numbers are very similar to the charac- 
teristics of traffic in developing regions [8]. 

We test three environments, reflecting the kinds of 
caches we expect to deploy. These are the low-end sys- 
tems that reflect the proxy powered by a laptop or simi- 
lar system, large-disk systems where a larger school can 
purchase external storage to pre-load content, and high- 
performance systems for ISPs and network backbones. 


5.2 Low-End System Experiments 


Our first test server for the proxy is designed to mimic 
a low-memory laptop, such as the OLPC XO Laptop, or 
a shared low-powered machine like an OLPC XS server. 
Its configuration includes a 1.4 GHz CPU with 512 KB 
of L2 cache, 256 MB RAM, two 60GB 7200 RPM SATA 
drives, and the Fedora 8 Linux OS. This machine is far 
from the standard commercial Web cache appliance, and 
is likely to be a candidate machine for the developing 
world [23]. 

Our tests for this machine configuration run at 40-275 
requests per second, per disk, using either one or two 
disks. Figure 3 shows the results for single disk perfor- 
mance of the Web proxy using HashCache-Basic (HC- 
B), HashCache-Set (HC-S), HashCache-SetMem (HC- 
SM), HashCache-Log without object prefetching (HC- 
LN), HashCache-Log with object prefetching (HC-L), 
Tiger and Squid. The HashCache tests use 60 GB caches. 
However, Tiger and Squid were unable to index this 
amount of storage and still run acceptably, so were lim- 
ited to using 18 GB caches. This smaller cache is still 
sufficient to hold the working set of the test, so Tiger and 
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Figure 3: Peak Request Rates for Different policies for 
low end SATA disk. 
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Table 7: Expected throughputs (reqs/sec) for policies 
for different disk speeds— all calculations include read 
prefetching 


Squid do not suffer in performance as a result. Table 7 
gives the analytical lowerbounds for performance of each 
of these policies for this workload and the disk perfor- 
mance. The tests for HashCache-Basic and HashCache- 
Set achieve only 45 reqs/sec. The tests for HashCache- 
SetMem achieve 75 reqs/sec. Squid scales better than 
HashCache-Basic and HashCache-Set and achieves 60 
reqs/sec. HashCache-Log (with prefetch), in compari- 
son, achieves 275 reqs/sec. The Tiger proxy, with its 
optimized indexing mechanism, achieves 250 reqs/sec. 
This is less than HashCache-Log because Tiger’s larger 
index size reduces the amount of hot object cache avail- 
able, reducing its prefetching effectiveness. 


Figure 4 shows the results from tests conducted 
on HashCache-SetMem and two configurations of 
HashCache-SetMemLRU using 2 disks. The perfor- 
mance of the HashCache-SetMem system scales to 160 
reqs/sec, which is slightly more than double its perfor- 
mance with a single disk. The reason for this difference 
is that the second disk does not have the overhead of han- 
dling all access logging for the entire system. The two 
other graphs in the figure, labeled HC-SML30 and HC- 
SML40, are the 2 versions of HashCache-SetMemLRU 
where only 30% and 40% of all the set headers are 
cached in main memory. As mentioned earlier, the 
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Figure 4: Peak Request Rates for Different St0CMemLRU 
policies on low end SATA disks. 
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Figure 5: Resource Usage for Different Systems 


hash table and the LRU list overhead of HashCache- 
SetMemLRU is such that when 50% of set headers are 
cached, it takes about the same amount of memory when 
using HashCache-SetMem. These experiments serve to 
show that HashCache-SetMemLRU can provide further 
savings when working set sizes are small and one does 
not need all the set headers in main memory at all times 
to perform reasonably well. 

These experiments also demonstrate HashCache’s 
small systems footprint. Those measurements are shown 
in Figure 5 for the single-disk experiment. In all cases, 
the disk is the ultimate performance bottleneck, with 
nearly 100% utilization. The user and system CPU re- 
main relatively low, with the higher system CPU lev- 
els tied to configurations with higher request rates. 
The most surprising metric, however, is Squid’s high 
memory usage rate. Given that its storage size was 
only one-third that used by HashCache, it still exceeds 
HashCache’s memory usage in HashCache’s highest- 
performance configuration. In comparison, the lowest- 
performance HashCache configurations, which have per- 
formance comparable to Squid, barely register in terms 
of memory usage. 
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Table 8: Performance on a high end system 
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Figure 6: Low End Systems Hit Ratios 


Figure 6 shows the cache hit ratio (by object) and the 
byte hit ratios (bandwidth savings) for the HashCache 
policies at their peak request rate. Almost all configu- 
rations achieve the maximum offered hit ratios, with the 
exception of HashCache-Basic, which suffers from hash 
collision effects. 

While the different policies offer different tradeoffs, 
one might observe that the performance jump between 
HashCache-SetMem and HashCache-Log is substantial. 
To bridge this gap one can use multiple small disks in- 
stead of one large disk to increase performance while 
still using the same amount of main memory. These 
experiments further demonstrate that for low-end ma- 
chines, HashCache can not only utilize more disk stor- 
age than commercial cache designs, but can also achieve 
comparable performance while using less memory. The 
larger storage size should translate into greater network 
savings, and the low resource footprint ensures that the 
proxy machine need not be dedicated to just a single 
task. The HashCache-SetMem configuration can be used 
when one wants to index larger disks on a low-end ma- 
chine with a relatively low traffic demand. The lowest- 
footprint configurations, which use no main-memory in- 
dexing, HashCache-Basic and HashCache-Set, would 
even be appropriate for caching in wireless routers or 
other embedded devices. 


5.3. High-End System Experiments 

For our high-end system experiments, we choose hard- 
ware that would be more appropriate in a datacenter. 
The processor is a dual-core 2GHz Xeon, with 2MB of 
L2 cache. The server has 3.5GB of main memory, and 
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Figure 7: High End System Performance Statistics 


five 1OK RPM Ultra2 SCSI disks, of 18GB each. These 
disks perform 90 to 95 random seeks/sec. Using our an- 
alytical models, we expect a performance of at least 320 
reqs/sec/disk with HashCache-Log. On this machine we 
run HashCache-Log, Tiger and Squid. From the Hash- 
Cache configurations, we chose only HashCache-Log 
because the ample main memory of this machine would 
dictate that it can be used for better performance rather 
than maximum cache size. 


Figure 7 shows the resource utilization of the three 
systems at their peak request rates. HashCache-Log con- 
sumes just enough memory for hot object caching, write 
buffers and also the index, still leaving about 65% of the 
memory unused. At the maximum request rate, the work- 
load becomes completely disk bound. Since the working 
set size is substantially larger than the main memory size, 
expanding the hot object cache size produces diminish- 
ing returns. Squid fails to reach 100% disk throughput 
simultaneously on all disks. Dynamic load imbalance 
among its disks causes one disk to be the system bottle- 
neck, even though the other four disks are underutilized. 
The load imbalance prevents it from achieving higher re- 
quest rates or higher average disk utilization. 


The performance results from this test are shown in 
Table 8, and they confirm the expectations from the ana- 
lytical models. HashCache-Log and Tiger perform com- 
parably well at 2200-2300 reqs/sec, while Squid reaches 
only 400 reqs/sec. Even at these rates, HashCache-Log 
is purely disk-bound, while the CPU and memory con- 
sumption has ample room for growth. The per-disk per- 
formance of HashCache-Log of 440 reqs/sec/disk is in 
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Table 9: Performance on large disks 
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line with the best commercial showings — the highest- 
performing system at the Fourth Cacheoff achieved less 
than an average of 340 reqs/sec/disk on 10K RPM 
SCSI disks. The absolute best throughput that we find 
from the Fourth Cacheoff results is 625 reqs/sec/disk 
on two I5K RPM SCSI disks, and on the same 
speed disks HashCache-Log and Tiger both achieve 700 
reqs/sec/disk, confirming the comparable performance. 

These tests demonstrate that the same HashCache 
code base can provide good performance on low- 
memory machines while matching or exceeding the per- 
formance of high-end systems designed for cache ap- 
pliances. Furthermore, this performance comes with a 
significant savings in memory, allowing room for larger 
storage or higher performance. 


5.4 Large Disk Experiments 

Our final set of experiments involves using HashCache 
configurations with large external storage systems. For 
this test, we use two | TB external hard drives attached to 
the server via USB. These drives perform 67-70 random 
seeks per second. Using our analytical models, we would 
expect a performance of 250 reqs/sec with HashCache- 
Log. In other respects, the server is configured compara- 
bly to our low-end machine experiment, but the memory 
is increased from 256MB to 2GB to accommodate some 
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Figure 9: Large Disk System Performance Statistics 


of the configurations that have larger index requirements, 
representative of low-end servers being deployed [24]. 

We compare the performance of HashCache-SetMem, 
HashCache-Log and HashCache-LogLRU with one or 
two external drives. Since the offered cache hit rate for 
the workload is 60%, we cache 6 out of the 8 log off- 
sets in main memory for HashCache-LogLRU. For these 
experiments, the Disk Table is stored on a disk separate 
from the ones keeping the circular log. Also, since filling 
the 1TB hard drives at 300 reqs/second would take exces- 
sively long, we randomly place 50GB of data across each 
drive to simulate seek-limited behavior. 

Unfortunately, even with 2GB of main memory, Tiger 
and Squid are unable to index these drives, so we were 
unable to test them in any meaningful way. Figure 8 
shows the size of the largest disk that each of the sys- 
tems can index with 2 GB of memory. In the figure, HC- 
SM and HC-L are HashCache-SetMem and HashCache- 
Log, respectively. The other HashCache configurations, 
Basic and Set have no practical limit on the amount of 
externally-attached storage. 

The Polygraph results for these configurations are 
shown in Table 9, and the resource usage details are in 
Figure 9, With 2TB of external storage, both HashCache- 
Log and HashCache-LogLRU are able to perform 600 
reqs/sec. In this configuration, HashCache-Log uses 
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slightly more than 60% of the system’s memory, while 
HashCache-LogLRU uses slightly less. The hit time for 
HashCache-LogLRU is a little higher than HashCache- 
Log because in some cases it requires 2 seeks (one for the 
position, and one for the content) in order to perform a 
read. The slightly higher cache hit rates exhibited on this 
test versus the high-end systems test are due the Poly- 
graph environment — without filling the cache, it has a 
smaller set of objects to reference, yielding a higher of- 
fered hit ratio. 

The 1TB test achieves half the performance of the 2TB 
test, but does so with correspondingly less memory uti- 
lization. The HashCache-SetMem configuration actually 
uses less than 10% of the 2GB overall in this scenario, 
suggesting that it could have run with our original server 
configuration of only 256MB. 

While the performance results are reassuring, these ex- 
periments prove that HashCache can index disks that are 
much larger than conventional policies could handle. At 
the same time, HashCache performance meets or exceeds 
what other caches would produce on much smaller disks. 
This scenario is particularly important for the develop- 
ing world, because one can use these inexpensive high- 
capacity drives to host large amounts of content, such 
as a Wikipedia mirror, WAN accelerator chunks, HTTP 
cache, and any other content that can be preloaded or 
shipped on DVDs later. 


6 Related Work 


Web caching in its various forms has been studied ex- 
tensively in the research and commercial communities. 
As mentioned earlier, the Harvest cache [4] and CERN 
caches [17] were the early approaches. The Harvest 
design persisted, especially with its transformation into 
the widely-used Squid Web proxy [35]. Much re- 
search has been performed on Squid, typically aimed 
at reorganizing the filesystem layout to improve perfor- 
mance [16, 18], better caching algorithms [14], or better 
use of peer caches [11]. Given the goals of HashCache, 
efficiently operating with very little memory and large 
storage, we have avoided more complexity in cache re- 
placement policies, since they typically use more mem- 
ory to make the decisions. In the case of working sets that 
dramatically exceed physical memory, cache policies are 
also likely to have little real impact. Disk cache replace- 
ment policies also become less effective when storage 
sizes grow very large. We have also avoided Bloom- 
filter approaches [2] that would require periodic rebuilds, 
since scanning terabyte-sized disks can sap disk perfor- 
mance for long periods. Likewise, approaches that re- 
quire examining multiple disjoint locations [19, 32] are 
also not appropriate for this environment, since any small 
gain in reducing conflict misses would be offset by large 
losses in checking multiple locations on each cache miss. 
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Some information has been published about commer- 
cial caches and workloads in the past, including the 
design considerations for high-speed environments [3], 
proxy cache performance in mixed environments [9], 
and workload studies of enterprise user populations [12]. 
While these approaches have clearly been successful in 
the developed world, many of the design techniques have 
not typically transitioned to the more price-sensitive por- 
tions of the design space. We believe that HashCache 
demonstrates that addressing problems specific to the de- 
veloping world can also open interesting research oppor- 
tunities that may apply to systems that are not as price- 
sensitive or resource-constrained. 

In terms of performance optimizations, two previ- 
ous systems have used some form of prefetching, in- 
cluding one commercial system [7], and one research 
project [33]. Based on published metrics, HashCache 
performs comparably to the commercial system, despite 
using a much similar approach to grouping objects, and 
despite using a standard filesystem for storage instead 
of raw disk access. Little scalability information is pre- 
sented on the research system, since it was tested only 
using Apache mod_proxy at 8 requests per second. Oth- 
erwise, very little information is publically available re- 
garding how high-performance caches typically oper- 
ate from the extremely competitive commercial period 
for proxy caches, centered around the year 2000. In 
that year, the Third Cache-Off [40] had a record num- 
ber of vendors participate, representing a variety of dif- 
ferent caching approaches. In terms of performance, 
HashCache-Log compares favorably to all of them, even 
when normalized for hardware. 

Web caches also get used in two other contexts: 
server accelerators and content distribution networks 
(CDNs) [1, 10, 26, 41]. Server accelerators, also known 
as reverse proxies, typically reside in front of a Web 
server and offload cacheable content, allowing the Web 
server to focus on dynamically-generated content. CDNs 
geographically distribute the caches reducing latency to 
the client and bandwidth consumption at the server. In 
these cases, the proxy typically has a very high hit rate, 
and is often configured to serve as much content from 
memory as possible. We believe that HashCache is 
also well-suited for this approach, because in the Set- 
MemLRU configuration, only the index entries for popu- 
lar content need to be kept in memory. By freeing the 
main memory from storing the entire index, the extra 
memory can be used to expand the size of the hot object 
cache. 

Finally, in terms of context in developing world 
projects, HashCache is simply one piece of the infras- 
tructure that can help these environments. Advances in 
wireless network technologies, such as WiMax [42] or 
rural WiFi [27, 36] will help make networking available 
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to larger numbers of people, and as demand grows, we 
believe that the opportunities for caching increase. Given 
the low resource usage of HashCache and its suitability 
for operation on shared hardware, we believe it is well- 
suited to take advantage of networking advancements in 
these communities. 


7 Deployments 


HashCache is currently deployed at two different lo- 
cations in Africa, at the Obafemi Awolowo University 
(OAU) in Nigeria and at the Kokrobitey Institute (KI) 
in Ghana. At OAU, it runs on their university server 
which has a 100 GB hard drive, 2 GB memory and a dual 
core Xeon processor. For Internet connection, they pay 
$5,000 per month for a 2 Mbps satellite link to an ISP in 
Europe and the link has a high variance ICMP ping time 
from Princeton ranging 500 to 1200 ms. We installed 
HashCache-Log on the machine but were asked to limit 
resource usage for HashCache to 50 GB disk space and 
no more than 300 MB of physical memory. The server 
is running other services such as a E-mail service and a 
firewall for the department and it is also used for general 
computation for the students. Due to privacy issues we 
were not able to analyze the logs from this deployment 
but the administrator has described the system as useful 
and also noticed the significant memory and CPU usage 
reduction when compared to Squid. 

At KI, HashCache runs on a wireless router for a small 
department on a 2 Mbps LAN. The Internet connection 
is through a 256 Kbps sub-marine link to Europe and the 
link has a ping latency ranging from 200 to 500 ms. The 
router has a 30 GB disk and 128 MB of main memory 
and we were asked to use 20 GB of disk space and as 
little memory as possible. This prompted us to use the 
HashCache-Set policy as there are only 25 to 40 people 
using the router every day. Logging is disabled on this 
machine as well since we were asked not to consume 
network bandwidth on transferring the logs. 

In both these deployments we have used HashCache 
policies to improve the Web performance while consum- 
ing minimum amount of resource. Other solutions like 
Squid would not have been able to meet these resource 
constraints while providing any reasonable service. Peo- 
ple at both places told us that the idea of a faster Internet 
to popular Web sites seemed like a distant dream until we 
discussed the complete capabilities of HashCache. We 
are currently working with OLPC to deploy HashCache 
at more locations with the OLPC XS servers. 


$ Conclusion and Future Work 


In this paper we have presented HashCache, a high- 
performance configurable cache storage for the devel- 
oping regions. HashCache provides a range of config- 
urations that scale from using no memory for indexing 


to ones that require only one-tenth as much as current 
high-performance approaches. It provides this flexibil- 
ity without sacrificing performance — its lowest-resource 
configuration has performance comparable to free soft- 
ware systems, while its high-end performance is compa- 
rable to the best commercial systems. These configura- 
tions allow memory consumption and performance to be 
tailored to application needs, and break the link between 
storage size and in-memory index size that has been com- 
monly used in caching systems for the past decade. The 
benefits of HashCache’s low resource consumption al- 
low it to share hardware with other applications, share 
the filesystem, and to scale to storage sizes well beyond 
what present approaches provide. 

On top of the HashCache storage layer, we have built 
a Web caching proxy, the HashCache Proxy, which can 
run using any of the HashCache configurations. Us- 
ing industry-standard benchmarks and a range of hard- 
ware configurations, we have shown that HashCache per- 
forms competitively with existing systems across a range 
of workloads. This approach provides an economy of 
scale in HashCache deployments, allowing it to be pow- 
ered from laptops, low-resource desktops, and even high- 
resource servers. In all cases, HashCache either performs 
competitively or outperforms other systems suited to that 
class of hardware. 

With its operation flexibility and a range of available 
performance options, HashCache is well suited to pro- 
viding the infrastructure for caching applications in de- 
veloping regions. Not only does it provide competitive 
performance with the stringent resource constraint , but 
also enables new opportunities that were not possible 
with existing approaches. We believe that HashCache 
can become the basis for a number of network caching 
services, and are actively working toward this goal. 
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tPlane Nano: Path Prediction for Peer-to-Peer Applications 
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Abstract 


Many peer-to-peer distributed applications can benefit 
from accurate predictions of Internet path performance. 
Existing approaches either 1) achieve high accuracy for 
sophisticated path properties, but adopt an unscalable 
centralized approach, or 2) are lightweight and decentral- 
ized, but work only for latency prediction. 

In this paper, we present the design and implementa- 
tion of iPlane Nano, a library for delivering Internet path 
information to peer-to-peer applications. iPlane Nano 
is itself a peer-to-peer application, and scales to a large 
number of end hosts with little centralized infrastructure 
and with a low cost of participation. The key enabling 
idea underlying iPlane Nano is a compact model of Inter- 
net routing. Our model can accurately predict end-to-end 
PoP-level paths, latencies, and loss rates between arbi- 
trary hosts on the Internet, with 70% of AS paths pre- 
dicted exactly in our evaluation set. Yet our model can 
be stored in less than 7MB and updated with approxi- 
mately 1MB/day. Our evaluation of iPlane Nano shows 
that it can provide significant performance improvements 
for large-scale applications. For example, iPlane Nano 
yields near-optimal download performance for both small 
and large files in a P2P content delivery system. 


1 Introduction 


Peer-to-peer (P2P) systems offer a number of potential 
advantages to the network systems designer, such as scal- 
ability, resilience, and perhaps most importantly, cost- 
effectiveness: P2P systems require little or no fixed in- 
frastructure, and yet can scale to millions of end hosts. 
These advantages have provoked considerable interest 
in the P2P design paradigm among researchers [10, 14, 
44]. There have also been several widespread deploy- 
ments, including BitTorrent file sharing [11], Skype’s 
use of detour routing for voice over IP [52], and multi- 
player game servers that reduce bandwidth costs by us- 
ing well-provisioned players to distribute objects to other 
peers [4]. 

In this paper, we argue that a key missing piece of in- 
frastructure for P2P applications is scalable and inexpen- 
Sive access to accurate information about Internet paths. 
P2P applications by their nature select among a large 
number of alternative paths; more accurate information 
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can help streamline that search process. For example, a 
P2P content distribution network [45, 38, 25] might bene- 
fit from directing requests to a replica with a low latency, 
low loss path. Similarly, an IP layer detour routing ser- 
vice would benefit from structural information about the 
Internet, to quickly find a path around a network fail- 
ure [59, 23]. 


While server-based solutions for providing timely in- 
formation about the Internet have been proposed and built 
in the past [30, 1], they are less appropriate in the P2P 
case. The iPlane [30] query engine, for example, runs as 
a service, but since its algorithms require multi-gigabyte 
memory resident data structures to generate predictions, 
it would be difficult and costly to scale, especially for 
a popular P2P application with millions of end hosts. 
iPlane’s memory footprint means it cannot even run on 
PlanetLab [41]. Further, iPlane’s data cannot be easily 
distributed given its size and running the service on a few 
nodes in turn significantly limits the rate at which queries 
can be served.While network coordinate systems [13] do 
scale, they only predict latency, and not the full range of 
topology-aware performance metrics needed by P2P ap- 
plications. 


To address this gap, we have designed and built a 
system called iPlane Nano, or iNano. iNano uses the 
Same input data and provides the same query interface 
as 1Plane, but is designed as a lightweight library that can 
run on client machines, and even on small devices such 
as Internet-capable smart phones. To make this work, we 
have developed a compact model of Internet topology, 
routing policy, and link performance metrics that can be 
represented in less than 7MB, and updated with approxi- 
mately 1MB/day. Yet this model is rich enough to be able 
to accurately predict end to end routes, latencies, and loss 
rates between arbitrary end hosts on the Internet. In our 
evaluation, we find that iNano predicts 70% of AS paths 
exactly, estimates latencies with less than 20ms of error 
for over 60% of paths, and estimates loss rates with less 
than 10% error for over 80% of paths. 


Because iNano’s data set is the same for all end hosts, 
both the model and its incremental daily updates can be 
efficiently distributed using standard file sharing tech- 
niques, such as via BitTorrent swarms. Our evaluation 
shows that although our predictions are based on only a 
tiny fraction of the total information available about the 
Internet, iNano can significantly improve application per- 
formance. For example, iNano yields near-optimal me- 
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dian download performance for both small and large files 
in a P2P content delivery system. 

In summary, our primary contribution is to develop 
an accurate yet lightweight approach for Internet perfor- 
mance prediction. To this end, we develop: 


e A pocket-sized, annotated link-level map of the Inter- 
net, that can be represented in 7MB and updated daily 
with 1MB of data. 


e Techniques to infer and concisely represent informa- 
tion stored in the forwarding tables of Internet routers, 
but in orders of magnitude lesser space. 


e Implementation of iNano, a system that enables 
Internet-scale P2P applications to discover properties 
of Internet paths. 


e Case studies using CDNs, VoIP, and detour routing to 
demonstrate the utility of iNano. 


2 Motivation and Design Goals 
2.1 Goals 


iNano targets network applications that choose among 
multiple candidate paths to improve data transfer perfor- 
mance. The design goals of iNano and their motivations 
are as follows. 

Rich path metrics: iNano should enable distributed 
applications to orchestrate their actions based on sophis- 
ticated path information. Application-perceived path per- 
formance may depend on one or more path metrics such 
as latency, loss rate, or bottleneck capacity. For exam- 
ple, TCP performance depends upon the latency as well 
as loss rate along the path, so a CDN re-director or Bit- 
Torrent tracker may wish to use both metrics in its deci- 
sions. A VoIP server such as in Skype may wish to pick 
a relay node according to the mean-opinion-score (MOS) 
metric [5] that depends upon loss rate and latency. Live 
video streaming systems [3, 2, 10] that set up an over- 
lay network among participating end-hosts may wish to 
incorporate path metrics such as latency, loss rate, and 
bottleneck capacity in the construction of the overlay. A 
combination of these metrics determines the quality of 
the video a client receives as well as its initial buffering 
delay. 

Scalable lookup: iNano should scale to every end- 
host in the Internet. The trend towards massively dis- 
tributed applications such as CDNs, BitTorrent, and 
Skype suggests that the potential demand for path per- 
formance prediction requests may be comparable to DNS 
or web search. Given the frequent occurrence of detour 
routes [48, 29], it is conceivable that every transfer is pre- 
ceded by a query about alternative paths to the destina- 
tion. Furthermore, the lookups must be local to be ef- 
fective; otherwise, the delay incurred may outweigh the 
resultant improvement in data transfer performance. 

Low infrastructure cost: iNano should incur a low 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 


infrastructure cost to set up and maintain. A server-based 
infrastructure will need to be continually provisioned as 
demand increases and will incur significant cost to de- 
ploy and maintain. Instead, iNano should leverage the 
property of P2P applications—users not only create de- 
mand but also contribute resources to the system—by 
using computing cycles and bandwidth on participating 
end-hosts rather than on dedicated servers. 

Structural information: iNano should enable net- 
work applications to base their decisions on the structure 
of the path. For example, recent proposals have advo- 
cated locality-aware peer selection in peer-to-peer sys- 
tems by either choosing paths that minimize the AS path 
length [9] or by jointly optimizing network cost and ap- 
plication performance [57]. Knowing the route can also 
enable applications to perform detour [48, 7] or multipath 
routing [58, 24] for reliability or performance objectives. 
Structural information can also be used to route around 
network failures [59, 23]. 

Arbitrary end-hosts: iNano should enable an appli- 
cation to infer path information between an arbitrary pair 
of end-hosts, not just from itself to others. Many of 
the examples above involving redirection in peer-to-peer 
content distribution, VoIP relays, multicast overlay con- 
struction, and detour routing require this capability. Fur- 
thermore, iNano should provide forward as well as re- 
verse path information between arbitrary end-hosts—a 
goal that is challenging even for paths originating locally 
because of the asymmetric nature of Internet routing. 


2.2 Exploring design alternatives 


Why can’t existing techniques achieve the above goals? 
To appreciate the challenge, let us consider a few natural 
design alternatives as shown in Table 1. 

Al is the well-studied network coordinates approach 
to infer latencies between end-hosts without on-demand 
measurement. In this approach, each end-host is assigned 
a coordinate, typically in a metric space, and the latency 
between two end-hosts is estimated as the distance be- 
tween their coordinates. Distributed systems such as Vi- 
valdi [13] implement the coordinate approach in a scal- 
able manner. However, the only information they provide 
to an application running on an end-host is the latency on 
paths from that end-host to the rest of the Internet. AI- 
though the coordinate system could potentially be mod- 
ified to predict latencies between arbitrary end-hosts by 
periodically disseminating a coordinate for every Inter- 
net prefix, it is unclear how to extend this approach to 
other path metrics such as loss rate or bottleneck capac- 
ity. Also, since coordinate systems rely only on end-to- 
end measurements, they do not provide information on 
the route traversed by a path. 

A2 is an approach where applications issue queries 
about path performance to a network information ser- 
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= 1: Qualitative comparison of design alternatives for Internet path performance prediction. 


vice hosted on centralized or replicated query servers. 
This approach is suggested and made plausible by prior 
work, namely iPlane, that developed techniques to accu- 
rately predict the path and path metrics between an arbi- 
trary pair of end-hosts. However, scaling replicated query 
servers to handle requests from all end-hosts—a work- 
load comparable to DNS—is challenging and would in- 
cur a huge infrastructure cost to set up and maintain. The 
number of query servers provisioned will need to grow 
in proportion to the number of end-hosts issuing queries, 
making this approach impractical for typical P2P appli- 
cations. 


A3 replicates a query server on each end-host. This 
approach dubbed as “network newspaper” in [30] would 
disseminate an atlas of measured Internet paths to end- 
hosts to enable them to locally service their queries. The 
atlas can be refreshed daily by sending incremental up- 
dates; since most Internet paths do not change over a day 
[40], daily updates are expected to be small. Unfortu- 
nately, iPlane’s atlas of paths is several gigabytes in size, 
making this approach unlikely to be adopted in practice. 
An alternative is to delegate this task to a local agent (like 
a local DNS nameserver) in each subnet, but the boot- 
strapping overhead would pose a barrier to widespread 
deployment and use. Another alternative is for each client 
to only download its “view” of the network, 1.e., proper- 
ties of paths originating at itself, but this approach does 
not allow an end-host to predict properties of paths be- 
tween arbitrary end-hosts, e.g., as required to enable de- 
tour routing. 


A4, where each end-host conducts its own measure- 
ments as needed, also suffers from the problem of not 
being able to predict properties of paths between arbi- 
trary end-hosts. Furthermore, such uncoordinated mea- 
surements might impose an unreasonable measurement 
overhead, e.g., measurement of loss rates and bandwidth 
Capacities require many large-sized packet probes to be 
sent into the network. A centralized coordinator and ag- 
gregator of measurements like iPlane amortizes this over- 
head, but makes dissemination a challenge as discussed in 
A2 and A3. 


3 iNano Design 


Our system, iNano, combines the best of the above alter- 
natives. For scalability, iNano replicates query servers at 
each end-host. To predict rich path metrics, iNano uses 
a structural technique like iPlane that predicts the PoP- 
level | path between an arbitrary pair of end-hosts. How- 
ever, the data required to make such predictions needs to 
be compact, like coordinates or like the AS-level Internet 
graph, unlike a huge atlas of measured paths. 

The key insight in iNano is a novel model for predict- 
ing paths and their properties between arbitrary end-hosts 
using a compact Internet atlas. 1Plane uses a path compo- 
sition technique to perform path predictions. To predict 
the path from a source to a destination, the path compo- 
sition technique composes two path segments that inter- 
sect with each other. The first segment is from a path out 
from the source to an arbitrary destination. The second 
segment is from a path measured from one of iPlane’s 
vantage points to the destination’s prefix. Depending on 
which intersecting pair of segments is chosen, the path 
obtained by composition is often similar to the actual 
route from source to destination. 

Instead of using an atlas of measured paths like 
iPlane’s, iNano uses an atlas of measured links. The 
space required by the former representation is propor- 
tional to the number of vantage points while the lat- 
ter representation requires space linear in the number of 
nodes and edges in the underlying Internet graph. Conse- 
quently, iNano’s atlas fits in less than 7MB, almost three 
orders of magnitude smaller than iPlane’s atlas, enabling 
it to be distributed to lightly powered end-hosts. The key 
challenge in making this approach work is to make accu- 
rate predictions about Internet path performance from an 
atlas of observed links. 

iNano’s approach of distributing a compact atlas and 
locally resolving queries at end-hosts avoids significant 
investment in server infrastructure. The approach also of- 
floads the bandwidth cost of disseminating the atlas and 
its periodic updates; the atlas can be swarmed among 
end-hosts using, for example, BitTorrent. The genera- 


'A Point-of-Presence (PoP) of an AS is the set of routers in that AS 
in the same location. 
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tion of the atlas itself is the only centralized component 
in iNano. A central coordinator distributes the task of 
issuing measurements to participating end-hosts and ag- 
gregates the measured paths into a set of measured links. 

iNano’s current measurement infrastructure is largely 
the same as that of iPlane [30] but processes the mea- 
surements in a completely different manner to make path 
performance predictions in keeping with the goals stated 
in Section 2. Although we use end-host measurements 
in building the atlas, we use as a starting point tracer- 
outes from PlanetLab [41] to destinations in 140K pre- 
fixes, which include roughly 90% of prefixes at the Inter- 
net’s edge. The interfaces discovered in the traceroutes 
are clustered together such that interfaces in the same 
Point of Presence (PoP) within an AS are in the same 
cluster; routers in the same PoP within an AS are similar 
from a routing perspective. To map the IP address of an 
interface to its corresponding AS, iNano uses the map- 
ping from prefixes to their origin ASes as seen in BGP 
feeds [33] and also resolves aliases [53] to ensure dif- 
ferent interfaces on the same router are mapped to the 
same AS. The clustering of interfaces in each AS into 
PoPs is performed using a combination of alias resolu- 
tion, mapping DNS names to locations [55], and identi- 
fying colocated interfaces based on similarity in reverse 
path lengths. 

iNano processes the gathered traceroutes in combina- 
tion with the PoP clustering information to build an at- 
las of inter-cluster links. To annotate links in this atlas 
with performance metrics, iNano performs measurements 
to infer the latencies and loss rates of inter-cluster links. 
iNano uses the frontier search algorithm described in [30] 
to partition the set of links across the PlanetLab vantage 
points, with some redundancy to account for measure- 
ment noise. Each node then attempts to measure the la- 
tency and loss rates of links assigned to it. The tech- 
nique for measuring loss rates is the same as that used 
by iPlane. Measuring latencies of links is hard due to the 
wide prevalence of asymmetric routing [40, 21]. iNano 
tackles this challenge using a two-pronged approach— 
first, by identifying symmetric paths, and second, by 
leveraging measurements of symmetric paths to measure 
latencies of other links that do not appear on symmetric 
routes. iNano’s link latency measurement techniques are 
described in [28]. To estimate the end-to-end latency and 
loss rate between a source and destination, iNano predicts 
the forward and reverse paths between these end-hosts 
and composes the properties of the inter-cluster links on 
the predicted paths. 


4 Route Prediction 


In this section, we develop an inference algorithm that 
predicts routes by composing observed links between 
routers. The set of observed links yields a graph cap- 
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turing the Internet’s physical topology. In order to pre- 
dict an end-to-end route accurately, we need to compactly 
model the routing decisions made by routers along can- 
didate paths in this graph. 

This inference and modeling problem is not easy. In- 
ferring routes would be easy using a naive model that ex- 
plicitly stores the information contained in the forwarding 
tables of routers in the graph. However, that defeats our 
primary goal of predicting routes using a compact graph 
representation. Thus, the key challenge to developing a 
compact model is to understand and describe the proce- 
dure routers use to compute routes, i.e., to concisely de- 
scribe how Internet routing works! 


4.1 The Problem: Modeling Internet Routing 


Compactly modeling Internet routing would be trivial if 
routers simply used shortest path routing. The weights 
used for shortest path computation could be inferred us- 
ing existing approaches [31]. However, Internet route se- 
lection is driven by a number of factors such as routing 
policies driven by economic considerations, traffic engi- 
neering driven by load balancing goals, and performance 
considerations that can not be characterized as shortest 
path routing. Furthermore, end-to-end Internet routes are 
computed by a set of complex interacting protocols (such 
as BGP, OSPF, and RIP) rather than a single protocol. 

Fortunately, we are aided by a large body of prior 
research on understanding and reverse-engineering the 
routing decision process, as well as the knowledge the 
research community has acquired on how Internet rout- 
ing works in practice. These result in the following com- 
monly accepted “textbook” principles about how Internet 
routing works. 


1. Policy preference: ASes use local preferences to se- 
lect routes. Typically, an AS prefers routes through 
its customers over those through its peers, and either 
of those over routes through its providers *. Further, 
ASes do not export all of their paths to their neigh- 
bors; for instance, ASes do not export paths through 
their peers to other peers/providers. Commonly 
used export policies and AS preferences are be- 
lieved to result in valley-free Internet routes [19], in 
which any path that traverses a provider-to-customer 
edge or a peer-to-peer edge does not later traverse a 
customer-to-provider or peer-to-peer edge. 


2. Shortest AS path: After applying local preferences, 
if a router has multiple candidate paths that it prefers 
equally, the default is to select the route containing 
the fewest ASes. Typically, several paths may have 
the same local preference and AS path length. 


*Customer ASes pay their providers while peers connect to each 
other at no cost. 
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P(w,d) = v.P(w, d); 
until N = N’ 





Figure 1: The algorithm used by GRAPH to predict a valley- 
free route from s to d in a graph G. @ 1s the operator that de- 
fines how edge weights compose in our application of Dijkstra’s 
shortest path algorithm. 


3. Exit policies: Among these, routes are chosen so as 
to meet intradomain objectives, e.g. by choosing the 
nearest exit point to the next AS (referred to as early- 
exit or hot potato routing) along the path. In some 
cases that often involve explicit compensation or ne- 
gotiation among adjacent ASes to reduce their com- 
bined costs, ASes adopt a late-exit policy. 


How well does the above procedure describe Internet 
routing? To evaluate this, we develop a simple algorithm 
based on dynamic programming that underlies various 
forms of shortest path computation. The algorithm in- 
corporates the above criteria to compute an on-demand 
route, based on a graph representation of the Internet. 

Our first attempt, GRAPH, reduces the representation 
size by over two orders of magnitude, but has poor predic- 
tion accuracy. This suggests that exceptions to the above 
criteria are common and must be carefully integrated into 
the model, as we describe in Sections 4.3.1—4.3.4. 


4.2 GRAPH: A first cut 


We present the algorithm in three steps. First, we describe 
a basic algorithm using dynamic programming (similar to 
Dijkstra’s shortest path algorithm) that captures the pref- 
erence for short AS paths, assuming early-exit between 
every pair of ASes. Second, we augment the algorithm 
to model late-exit when necessary. Third, we augment 
the algorithm to model common export policies and local 
preferences for routes. 


4.2.1 Basic algorithm 


Figure | shows the pseudocode for GRAPH, an algorithm 
that predicts the route between a source s and a destina- 
tion d. It chooses the shortest AS path among all valley- 
free paths between s and d; further, it uses early-exit at 
every AS. The algorithm is similar to Dijkstra’s shortest 
path algorithm. Unlike conventional Dijkstra however, 


the route computation 1) backtracks from the destination 
to all sources, and 2) uses a two-tuple cost metric. 

The cost of a route from each node v to the destina- 
tion d, represented as D(v, d), is a strictly ordered two- 
tuple [number of AS hops to the destination, cost to exit 
the current AS], with the first component considered as 
the more significant value. For two adjacent nodes v 
and w connected by a link of latency /(v,w), the cost 
of the edge between them, represented as c(v, w), is de- 
fined as [0,/(v, w)| if v and w are in the same AS, and 
as |1,0| otherwise. The @ operator in the algorithm re- 
sets the second component to 0 upon crossing an AS 
boundary as follows. If v and w belong to the same AS, 
D(w,d) @ c(v, w) is defined as D(w, d) + [0,1 (v, w)], 
where ‘+’ does the usual component-wise addition. If uv 
and w belong to adjacent ASes, D(w, d) @ c(v, w) is de- 
fined as |[D(w, d) [1] +1, 0]. It is straightforward to verify 
that this definition of cost preserves the invariant that if a 
node u € N’, then P(u,d) is a shortest path from wu to 
d. As in Dijkstra’s algorithm, this invariant ensures the 
correctness of the algorithm. 


4.2.2 Incorporating late-exit 


It is straightforward to extend the above algorithm to han- 
dle pairs of ASes that use late-exit instead of early-exit. 
We model late-exit as two adjacent ASes v,w (such as 
AS6380 and AS6389 — both of which are owned by Bell 
South) jointly computing the path through them in or- 
der to minimize the overall transit latency. To infer late 
exit, we use the technique proposed in [54]. We sim- 
ply redefine the © operator in the following way. An 
inter-AS edge (v,w) corresponding to a late-exit route 
has c(v, w) = [0,/(v, w)], meaning that it is treated as an 
intra-AS edge. We do however have to increment the AS 
hop count by two when we backtrack out of the AS con- 
taining v. This is accomplished by maintaining another 
component in the cost tuple that corresponds to the num- 
ber of consecutive late-exit transitions. This component 
corresponds to the number of AS hops that are not yet ac- 
counted for in the AS path length component of the cost 
metric. Whenever an AS transition is traversed where late 
exit is not applied, this third component is added into the 
AS path length component and reset to zero. 





4.2.3. Incorporating export policies 


Next, we incorporate constraints corresponding to com- 
monly used export policies. We infer AS relation- 
ships, such as which are peers and which have paid cus- 
tomer/provider transit , using a combination of CAIDA’s 
inferences [16] and Gao’s technique [19]. We model the 
default export policy in which an AS advertises any paths 
through customer ASes to all its neighbors, and it exports 
paths from peers and providers to only its customers. It 
is well-known that this export policy leads to valley-free 
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Figure 2: Route prediction from S to D so as to satisfy 
customer<peer<provider preferences. Dark nodes are down 
nodes, and light nodes are up nodes. Bold lines go from cus- 
tomers to their providers, dashed lines connect peers, and faded 
dotted lines go from providers to their customers. GRAPH tra- 
verses all the customer-to-provider edges in the first phase to 
finalize routes from 3 and 4 to D. Only peering links are tra- 
versed in the second phase making 2 choose a path through 3 
over a shorter one via 4. Finally, provider-to-customer edges 
are traversed. 


routes. 

To compute valley-free routes, instead of having a sin- 
gle node for each cluster (PoP) 7, we instead introduce 
two nodes in the graph: an up node up, and a down 
node down;, and GRAPH computes the path from up, to 
downg. The idea is that the construction of edges will 
force every path to transition from up nodes to down 
nodes at most once, thereby guaranteeing the path 1s 
valley-free. Let 2 and 7 be two clusters observed as adja- 
cent. 


1. If 2 and 7 belong to the same AS, there is an undi- 
rected edge between up; and up, and one between 
down; and down;. 


2. If 2’s AS is a provider of 7’s AS, there is a directed 
edge from up, to up; and another directed edge from 
down; to down;. This edges capture that a customer 
will not provide transit between two providers. 


3. If 2 and 7 belong to peer ASes, there is a directed 
edge from up; to down; and from up, to down;. 
These edges capture that 2’s AS will use paths 
through 7 only for itself and its customers (and sim- 
ilarly for 7’s AS and paths through 2). 


Finally, for each IP address 7, there is a directed edge 
from up, to down;,. It is easy to verify that all routes in the 
graph are valley-free by construction (after transitioning 
from up to down, a transition from down to up can no 
longer occur). 
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4.2.4 Incorporating local preferences 


Next, we incorporate local preferences in selecting AS 
paths. We assume that an AS prefers paths through its 
customers over those through its peers, which are in turn 
preferable to paths through provider ASes. To incor- 
porate these preferences, instead of calculating paths to 
the destination from all ASes and all routers in a batch, 
GRAPH computes routes in three phases. 

Figure 2 illustrates the phased approach. GRAPH first 
limits the graph to contain only the set of down nodes, 
along with the edges connecting them, and computes the 
optimal paths from these nodes to the destination. This 
frontier reaches precisely the routers in those ASes that 
get paid for providing transit to the destination. Once 
all such nodes have been visited and their best paths dis- 
covered, the algorithm is allowed to reach any additional 
nodes that can be reached only using peering; by con- 
struction, only one peering is traversed. Finally, the algo- 
rithm is allowed to use any link (e.g., provider links) to 
reach all remaining addresses. 


Results preview: As we show in detail in Section 6, 
GRAPH—despite taking into account many aspects of 
default routing behavior—correctly predicts only 30% 
of the AS paths for our measured dataset. In contrast, 
the path composition approach [30] (that dominates our 
achievable accuracy) achieves 70% accuracy using the 
entire set of observed routes. 

On the other hand, the storage overhead of GRAPH 
is directly proportional to the number of observed Inter- 
net links. As we will see in the evaluation section, this 
is two orders of magnitude more compact than the path 
composition approach. Thus, the challenge is to improve 
GRAPH’s accuracy while keeping it compact. 


4.3 Addressing sources of prediction error 


A careful examination of the above results reveals that 
GRAPH’s inaccuracies arise partly from our failure to 
model certain other aspects of Internet routing behav- 
ior and partly from errors in inferred AS relationships. 
GRAPH’s deficiencies are due to the following reasons. 


1. Asymmetry: A significant fraction of Internet routes 
are asymmetric [40, 21]. While GRAPH reflects 
some asymmetry, e.g., due to early exit routing, it 
does not fully capture asymmetric policy behavior. 


2. Inaccurate export policy: If GRAPH fails to identify 
a peer-to-peer relationship between two ASes, it is 
overly lenient in inferring export policy and predicts 
non-existent routes that would be filtered in practice. 


3. Incorrect local preferences: An AS’s customer may 
be a provider for specific paths. For example, two 
ASes may have different relationships in different 
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regions because one AS may have larger network 
presence than the other in one region and vice-versa 
in another region. Incorrect local preferences could 
result in an AS selecting a less preferable route, e.g., 
via a customer. 


4. Traffic engineering: ASes may engineer routes in 
order to improve routing for their customer traffic 
compared to transit traffic. 


We address each of these challenges by adding infor- 
mation in our data set back into the graph. 


4.3.1 Addressing asymmetry 


Due to the asymmetric nature of Internet routing, adding 
routes originating from the source to the atlas signifi- 
cantly improves the accuracy of predicted routes [29]. To 
reduce the likelihood of predicting non-existent routes, 
iNano splits the graph into two subgraphs: 1) TO_DST 
that consists of all directed links observed on the tracer- 
outes from iNano’s vantage points to all prefixes, and 
2) FROM_SRC that consists of all directed links on the 
traceroutes contributed to iNano by participating end- 
host sources. 

For each cluster, we introduce a directed edge from its 
corresponding node in FROM_SRC to its corresponding 
node in TO_DST. iNano then predicts the route using the 
Dijkstra-style algorithm that backtracks from the down 
node corresponding to the destination in TO_DST to the 
up node corresponding to the source in FROM_SRC. If it 
fails to find such a route, a likely scenario if the atlas lacks 
sufficient paths from the source prefix, then it attempts 
to find a path from the down node corresponding to the 
destination in TO_DST to the up node corresponding to 
the source in TO_DST. 


4.3.2 Inferring export policies 


GRAPH predicts non-existent routes that would be filtered 
given accurate AS relationships. Recall that we inferred 
the AS relationships automatically by analyzing observed 
behavior. Now, instead of explicitly distilling the AS re- 
lationships from the observed routes, we explore an al- 
ternate strategy that trades off a small amount of space 
for improved prediction accuracy. We seed iNano with 
known templates of export policy, e.g., if we observe a 
path that traverses the ASes Cogent, AT&T, and Sprint, 
we know that AT&T exports paths from Sprint to Cogent. 

To implement this strategy, the valley-free check in 
GRAPH is replaced with the following 3-tuple check. 
iNano explicitly stores the list of all 3-tuples correspond- 
ing to three consecutive ASes observed in traceroutes as 
well as BGP feeds (discounting prepending). Ideally, we 
would consider a predicted route valid only if all con- 
stituent segments of size three satisfy the 3-tuple check by 
appearing in the list, meaning that the path was exported 


preferences 3-tuples 
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Figure 3: Predicting the path from S to D. Thicker lines show 
preferences, dashed lines show non-provider links, and dark 
lines show the prediction. iNano cannot choose 1 — 5 — 4 be- 
cause the 3-tuple does not appear and cannot choose 1 — 7 — 4 


because 7 is not a provider for 4. It predicts 1 — 2 — 3-4 
because of the preference for 2 over 5. 


at every intermediate AS. In Figure 3, we see that, even 
though it is shorter, iVano cannot choose path 1—5—4 be- 
cause the 3-tuple (1, 5, 4) does not appear in any BGP ad- 
vertisement or traceroute. iNano easily incorporates the 
check in the backtracking step of the algorithm. How- 
ever, since visibility into ASes at the edge is limited, we 
might fail to observe all of the export policies for the 
edge ASes. iNano thus performs this check only for 3- 
segments in which the degree of the middle AS in the 
Internet’s AS-level graph is greater than a threshold (5 in 
the current implementation). Finally, we assume commu- 
tativity among triples, so that if we observe (ASI, AS2, 
AS3), we include (AS3, AS2, AS1) as well. 


4.3.3. Improving local preferences 


Recall that we infer AS relationships and incorporate the 
customer<peer<provider preference order in the route 
prediction algorithm. Unfortunately, AS relation infer- 
ence by itself is difficult and error-prone. For example, 
AS relationship inference based on Gao’s algorithm [19] 
predicts that half of the edges observed between the top 
hundred ASes ranked by degree correspond to sibling re- 
lationships, which seems rather implausible. The 3-tuple 
check by itself is not sufficient; although it ensures that 
predicted routes consist only of observed tuples, it does 
not take AS preferences into account when multiple op- 
tions are available. 


iNano uses a relationship-agnostic method to infer AS 
preferences based only on observed routes. We infer 
these preferences using the entire set of observed paths, 
but include only the results of the inferences within the 
compressed link-level representation of the atlas. The 
technique works as follows. For each observed AS route 
r, let r1,..., 1m be the set of alternative routes available 
from the source, visible in the topology but not taken. For 
each route 7;, if r and r; share the first k ASes but differ 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 


143 


144 


at the (k + 1)’th AS, then the &’th AS is said to prefer the 
(k + 1)’th AS on r over the (k + 1)’th AS on r;. Each 
alternative route in the set r,,...,7,, similarly yields a 
preference. 

iNano stores the preferences obtained above as 3- 
tuples (ASI, AS2 > AS3), where AS1 prefers a route 
through AS2 over a route through AS3 when both routes 
are of the same length. In Figure 3, iNano selects the path 
1 — 2 — 3 —4 over the path 1 — 5 — 3 — 4 because of a 
preference (1, 2 > 5). In some cases, we observe both 
3-tuples (AS1, AS2 > AS3) and (AS1, AS3 > AS2). So, 
we include the preference (ASI, AS2 > AS3) only if it 
was observed at least three times as often as the prefer- 
ence (ASI, AS3 > AS2). If not, we ignore both pref- 
erences; we conjecture that such wavering preferences 
are likely due to load balancing by ASI. While some 
AS preferences might be restricted to paths from specific 
source prefixes or to specific destination prefixes, iNano’s 
model of Internet routing currently captures only prefer- 
ences valid across sources and destinations. However, as 
we show in our evaluation, this suffices to significantly 
improve prediction accuracy. 


4.3.4 Incorporating traffic engineering 


In many cases, we observe an edge from AS1 to AS2 on 
some route in the atlas, but never see this edge on a route 
terminating at AS2, 1.e., when the destination is in AS2. 
This occurs when an AS provides transit using one policy 
but routes to its own prefixes using a different policy, e.g., 
AS2 provides transit from AS1 to other ASes but does not 
send out BGP updates to AS1 for its own prefixes. The 
optimizations described above, the 3-tuple check and AS 
preferences, are insufficient to handle such cases. 

To address the problem, iNano explicitly maintains in- 
formation about provider ASes. For each AS, we deter- 
mine its upstream neighbor ASes, 1.e., the set of ASes ob- 
served immediately prior to this AS in the atlas. We also 
determine the set of providers for each AS, 1.e., the set of 
ASes observed upstream of this AS when it is the origin. 
For the latter, we use both our traceroute data as well as 
BGP snapshots [33, 47]. For 1,352 ASes out of a total of 
27,515 ASes in the atlas, we find the set of providers to be 
a proper subset of the set of upstream neighbors. In such 
cases, the previous algorithms could give the wrong path. 
We refine the approach further to determine the provider 
and upstream neighbor sets on a per-prefix basis. In Fig- 
ure 3, iNano cannot select the path 1 — 7 — 4, even though 
it is shorter, because 7 is not a provider for 4. 


5 Implementation of iNano 


Our implementation of iNano can roughly be divided 
into two logical components—server-side and client-side. 
The primary function of the server-side implementation 
is to gather measurements and to build the link-based at- 
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las as described in the previous section. In addition, the 
iNano server bootstraps the distribution of the atlas to 
end-hosts. 

The client-side implementation comprises a library 
providing information about Internet paths. The library 
performs four functions—fetching the atlas, augmenting 
the atlas with local measurements, servicing queries for 
path information from applications, and keeping the atlas 
up-to-date. 

Fetching the Atlas: On startup, the iNano library 
fetches the atlas required for making predictions. The 
atlas fetched includes the following datasets: the set 
of inter-cluster links annotated with latencies and loss 
rates, data to map IP addresses to prefixes and ASes, 
AS degrees, AS 3-tuples, AS preferences, and the set of 
providers for each AS. Having all end-hosts fetch the at- 
las from iNano’s server would require an extremely large 
amount of bandwidth to be provisioned at the server. This 
would significantly drive up the cost required to run and 
maintain iNano. 

Therefore, we instead rely on swarming the atlas 
across clients in order to distribute it. iNano’s central 
server serves as the seed for the dissemination of the at- 
las. In addition, every end-host running the iNano library 
makes available the portion of the atlas it has downloaded 
for other end-hosts to download. We have made our im- 
plementation sufficiently modular that any peer-to-peer 
filesharing protocol can be plugged in for distribution of 
the atlas. Our current implementation uses CoBlitz [39] 
and we are working on a version that uses BitTorrent [11]. 

Client-side Measurements: As previously explained 
in Section 4.3.1, iNano explicitly incorporates path asym- 
metry into its prediction model to improve the accuracy 
of path prediction. To enable this, iNano’s library in- 
cludes a measurement toolkit used to gather measure- 
ments of the Internet from the perspective of end-hosts. 
The library uses this toolkit to issue traceroutes daily to 
destinations in a few hundred prefixes, chosen at random 
from all the routable prefixes in the Internet. The new 
links discovered as part of these traceroutes are added to 
the FROM_SRC plane of the atlas. The library also up- 
loads the measured traceroutes to the central server. The 
server incorporates these measurements into the atlas dis- 
tributed out to all end-hosts. Buggy or malicious clients 
could distort the atlas by contributing incorrect or fab- 
ricated measurements. While such discrepancies could 
be inferred by comparing with measurements from other 
clients, we leave such inference to future work. 

Serving Queries: Once the atlas is fetched and aug- 
mented with client-side measurements, the library starts 
up a local query server. This query server implements the 
prediction algorithm developed in Section 4. The API 
exported by the library enables applications to query for 
information on paths between (src, dst) IP address pairs 
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Table 2: Current size of iNano’s atlas, in terms of number of 
entries, compressed bytes on disk, and the delta between con- 
secutive days. 





in batches of arbitrary sizes. In future work, we plan to 
support remote queries so that only one local host need 
download the atlas. 

Keeping Atlas Up-to-date: Paths and path proper- 
ties on the Internet change over time. Hence, iNano’s 
atlas needs to be kept up-to-date to reflect current net- 
work conditions. Fortunately, the stationarity of Internet 
routing keeps the bandwidth cost of such updates low. A 
significant fraction of Internet routes are stationary [40] 
across days and path properties are stationary [60, 30] on 
the timescale of several hours. Therefore, as we show 
later in our evaluation, the difference between the atlases 
of consecutive days can typically be represented in ap- 
proximately IMB. As a result, once an end-host fetches 
the complete atlas, it can maintain an up-to-date atlas 
thereafter by downloading a daily 1MB update also as 
a swarmed file download. 


6 Evaluation 


In this section, we evaluate the accuracy of iNano’s pre- 
dictions of paths and path properties, and study the con- 
tribution that each of iNano’s components makes towards 
its predictive ability. We also quantify the stationarity of 
iNano’s atlas across days, iNano’s storage requirements, 
and how the atlas size would grow with additional van- 
tage points. 


6.1 Size of the atlas 


First, we discuss the typical size of iNano’s atlas and then 
evaluate how this size would scale with measurements 
from more vantage points. 


6.1.1 What is the current size of the atlas? 


We describe a typical day’s atlas that we use for most 
of the evaluation in this section. We leverage PlanetLab 
nodes as vantage points for gathering the iNano atlas. The 
atlas we use in our evaluation comprises traceroutes from 
197 PlanetLab nodes to one destination each in 140K pre- 
fixes. All of these traceroutes were gathered over the 


course of a day. After alias resolution and clustering, 85K 
distinct clusters are present in the atlas, with 309K links 
between them. The dataset obtained by combining these 
inter-cluster links annotated with latencies and loss rates, 
observed AS 3-tuples, inferred AS preferences, and the 
mapping of ASes to their providers is roughly 6.6MB in 
size. AS 3-tuples, the dataset with the most number of en- 
tries, are highly amenable to compression because only 
2500 ASes, less than 10% of all the ASes in the atlas, 
occur as the middle component of any 3-tuple. Table 2 
shows the size associated with each of these components 
of the atlas. 


6.1.2 Does iNano’s atlas scale w.r.t vantage points? 


iNano uses measurements from end-hosts to improve pre- 
diction accuracy for asymmetric routes. However, adding 
more measurements could significantly inflate the size of 
iNano’s atlas, questioning the basic tenet of our work—is 
the atlas still tractable if it includes measurements from 
millions of end-hosts? 

To study this question, we use the DIMES measure- 
ment infrastructure [50]. The DIMES project runs an In- 
ternet measurement agent on a few thousand end-hosts 
distributed worldwide. We issued traceroutes from 845 
DIMES agents to 100 randomly chosen destinations each 
over the course of a week. 

The addition of measurements from more vantage 
points primarily impacts the number of inter-cluster links 
and the number of AS three-tuples in the atlas. As stated 
previously, measurements from PlanetLab find approxi- 
mately 309K links and 1.05M AS three-tuples. Including 
the measurements from the 845 DIMES agents into the 
atlas added approximately 16K links and 14K AS three- 
tuples in total. Even though the addition of links from 
more vantage points is likely to be sublinear in practice, 
we extrapolate linearly to get a conservative estimate of 
the increase in the size of the atlas if we had measure- 
ments from all of the Internet’s edge. Including tracer- 
outes from end-hosts in all 100K prefixes at the Internet’s 
edge would increase the number of links in the atlas from 
309K to approximately 2.2M (16K new links added for 
every 845 hosts), an eight-fold increase, and the num- 
ber of AS three-tuples from 1.05M to 2.7M (14K new 
three-tuples for every 845 hosts), a three-fold increase. 
Assuming this data is as compressible as the PlanetLab 
data, this would add 18MB to the atlas and 5MB to the 
daily update. It is future work to determine how much of 
this data is truly needed, discarding information that adds 
little in terms of added accuracy. 


6.2 Stationarity of measurements 


iNano refreshes its atlas once every day. ‘To eval- 
uate whether the interval of a day between up- 
dates suffices, we examine the stationarity of the 
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Figure 4: Similarity of PoP-level paths across consecutive days 
for routes measured from 195 PlanetLab nodes to destinations 
in 140K prefixes. 


two kinds of measurements—traceroutes and loss rate 
measurements—used to construct iNano’s atlas. Our link 
latencies do not capture transmission and queueing de- 
lays, and hence, are extremely stable. We then present 
the size of the difference between successive atlases that 
arises as a result of the stationarity in measurements. 


6.2.1 How stationary are routes? 


We studied the stationarity of routing by comparing the 
traceroutes measured from each of 195 PlanetLab nodes 
to destinations in 140K prefixes on successive days. 
Since iNano only considers the Internet topology at the 
granularity of clusters corresponding to PoPs, we map 
traceroutes to cluster-level paths for comparison. We 
compared every path between a PlanetLab node and a 
destination on one day with the same path the next day 
using the path similarity metric [22, 29]. The similarity 
metric compares two paths as the ratio of the size of the 
intersection to the size of the union, of the sets of clusters 
in each of the paths; the ordering of clusters in the paths 
is not considered. The maximum value of this metric is 
1 when both paths pass through exactly the same set of 
clusters, and the minimum value is 0 when the paths are 
completely disjoint. Figure 4 shows the distribution of 
PoP-level path similarity we obtained by comparing paths 
across consecutive days, grouping the similarity values 
into bins of 0.05. 91% of the paths on the first day have 
a similarity of at least 0.75 with the corresponding paths 
measured the next day, 687% have a similarity of at least 
0.9, and 50% remain identical. 

The main prior work on studying path stationarity has 
been by Paxson [40] and Zhang et al. [60]. Both observed 
more stationarity in routes than we do—Paxson found 
68% of paths to be identical across days at the granu- 
larity of routers, and Zhang et al. found the same number 
to be more than 75%. We believe the difference in our 
findings is due to our significantly larger dataset. Pax- 
son’s measurement dataset included traceroutes between 
27 vantage points and Zhang et al. used traceroutes be- 
tween 220 vantage points. In contrast, our analysis of 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 


path stationarity uses traceroutes from 195 vantage points 
to 140K destinations each. 


6.2.2 How stationary are loss rates? 


To evaluate the stationarity of packet loss, we probed 
paths from 201 PlanetLab nodes to destinations in 5000 
randomly chosen prefixes each. We sent out 100 ICMP 
probes of size 1KB on each path, with successive probes 
separated by 2 seconds, and determined the fraction of 
probes for which we received no response. We repeated 
these loss measurements 6 hours later. We found that 
66% of paths on which we originally observed packet 
loss continued to be lossy 6 hours later. We also repeated 
these measurements 12 hours and 24 hours after the orig- 
inal measurements. The fraction of lossy paths that con- 
tinued to remain so decreased from 66% to 53% when 
the interval between measurements was increased from 6 
hours to 12 hours but remained steady at 53% when the 
interval was increased further to 24 hours. 


6.2.3. How stationary is iNano’s atlas? 


As a result of the significant stationarity seen in both 
paths and path properties over the interval of a day, the 
difference between iNano’s atlases on consecutive days 
is much smaller in size than the atlas itself. To update 
the atlas from the previous day, iNano ships the union 
of the old entries not present any more and new entries 
added to the inter-cluster links, link loss rates, and ob- 
served AS three-tuples datasets. The size of the link loss 
rates delta is larger than the loss rates dataset itself be- 
cause we have to update a link’s loss rate not just when 
it changes from being lossless to lossy (or vice-versa), as 
in our study on stationarity of loss above, but also when 
the link’s loss rate changes. All the other datasets do not 
change on a day-to-day basis and hence, are updated in 
full only once a month. Table 2 shows that the typical 
difference is 1.34MB in size, less than one-fifth the typ- 
ical size of a complete atlas. This implies that once an 
end-host downloads iNano’s atlas, it can keep its local in- 
formation up-to-date by fetching a significantly smaller 
update daily thereafter. 


6.3 Accuracy of Predictions 


We next evaluate the accuracy of iNano’s predictions 
of both paths and path properties. From the 197 van- 
tage points used in gathering the atlas described in Sec- 
tion 6.1.1, we choose a subset of 37 at random as our rep- 
resentative end-hosts. We pick 100 random traceroutes 
performed from each of them. After discarding paths that 
do not reach the destination or have AS-level loops, we 
are left with a validation set of 2816 paths. To predict the 
paths and path properties from one of the 37 sources, we 
include links from all traceroutes from the remaining 196 
vantage points in the TO_DST plane and links from 100 
other randomly chosen traceroutes from this source in the 
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Figure 5: AS path prediction accuracy for measured traces as 
components are incorporated into iNano. RouteScope is the 
algorithm from [32], GRAPH is the algorithm described in 
Section 4.2, and path-based is the iPlane algorithm. Improved 
path-based incorporates iNano’s techniques into the i1Plane al- 
gorithm. 


FROM_SRC plane. 


6.3.1 CaniNano predict AS paths accurately? 


We evaluate the accuracy of iNano’s ability to predict the 
AS paths in our validation set. We evaluate the accuracy 
of iNano’s path prediction only at AS-level and not at 
PoP-level because our dataset clustering router interfaces 
into PoPs is complete. As a result, when our clustering 
indicates that two PoP-level paths are not identical, it is 
hard to say whether the difference is because of the in- 
completeness of our clustering data or they are indeed 
different. In contrast, our mapping from IPs to ASes is 
significantly more comprehensive. 


Figure 5 shows the improvement in accuracy of AS 
path prediction as each component of iNano is incorpo- 
rated into the GRAPH algorithm. The fraction of paths for 
which we predict the AS path exactly right increases from 
31% with GRAPH to 70% with all components of iNano 
included. Each of the four techniques that iNano uses 
significantly improves iNano’s ability to predict paths. In 
fact, our final predictive model achieves the same AS path 
accuracy as iPlane’s path composition technique, which 
uses a path-based dataset two orders of magnitude larger 
than iNano’s link-based atlas. Furthermore, iNano out- 
does path composition in the ability to predict AS path 
length. 


Figure 5 also compares iNano’s AS path prediction ac- 
curacy with that of RouteScope [32], the only prior work 
that predicts AS paths from a graph representation of In- 
ternet topology. First, RouteScope computes relation- 
ships between ASes using an observed set of AS paths 
as input. However, to predict the path between a (src, 
dst) pair, it needs only the AS-level graph of the Internet. 
RouteScope computes the set of shortest AS paths deter- 
mined to be valley-free between the AS of svc and the AS 





of dst. For the problem setting targeted by iNano, a single 
predicted path is required to estimate end-to-end perfor- 
mance. Therefore, to evaluate the utility of RouteScope 
in this setting, we choose one path at random from the 
set of paths returned by RouteScope for each (src, dst) 
pair. RouteScope’s accuracy at predicting AS path length 
is only as good as that of GRAPH, and its accuracy at 
predicting the correct AS path is worse than GRAPH’s. 
iNano’s significantly better accuracy stems from its mod- 
eling of Internet routing at PoP-level instead of AS-level 
and its modeling of routing with techniques beyond sim- 
ple valley-free routing. 

iNano’s techniques are also applicable to a structural 
approach that works by composing path segments. We 
incorporate these techniques into iPlane’s path composi- 
tion algorithm to improve the accuracy of prediction us- 
ing an atlas of paths. When two path segments are being 
spliced together, we check whether the sequence of ASes 
prior to, at, and after the point of intersection exists in 
our database of 3-tuples. We also ensure that AS prefer- 
ences are enforced when multiple candidate intersections 
pass the 3-tuple check. Figure 5 shows that the modified 
path composition technique increases iPlane’s ability to 
predict AS paths from 70% to 81%. 

The ability to predict paths using either iNano or path 
composition is limited by two factors, the comprehen- 
siveness of the atlas measured from our vantage points 
and the accuracy of our inferred routing policies. We 
quantified the contribution of the former to the inaccu- 
racy in path predictions as follows. For each path in our 
validation set, we determined whether all the inter-cluster 
links on the path were present in the corresponding atlas 
used to predict the path. 7% of paths were such that at 
least one of the inter-cluster links along the path was not 
observed in the atlas used for prediction. Therefore, if we 
had better coverage of the Internet’s topology with mea- 
surements from more vantage points, the accuracy of path 
prediction could increase to up to 77% using iNano and 
to up to 88% using path composition. 


6.3.2 How accurately can iNano estimate path prop- 
erties? 


Next, we evaluate iNano’s ability to estimate latencies 
along paths to arbitrary end-hosts. For each of the paths 
used in our evaluation of path prediction accuracy, we 
compose iNano’s link latency estimates along the pre- 
dicted forward and reverse paths to derive an estimate 
for the end-to-end latency. Figure 6 shows the error in 
iNano’s latency estimates. We derive latency estimates 
for the same paths using the path-composition technique 
of iPlane [30] and using Vivaldi [13], a popular network 
coordinate system. iNano’s median latency estimation er- 
ror is 11ms, as compared to a median error of 20ms with 
Vivaldi. The path composition technique yields an even 
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Figure 6: Accuracy of latency estimates along paths to arbitrary 
destinations. 
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Figure 7: Accuracy of techniques in predicting 10 closest desti- 
nations (in terms of delay). 


lower median error of 6ms, partly because of its better 
accuracy at predicting paths and partly because estimates 
of latencies along path segments tend to be more accurate 
than the sum of individual links. 


However, the order of the three lines is reversed in the 
tail. iNano yields better latency estimates than the path 
composition technique in the tail because of differences 
in the methodology used to obtain link latencies for the 
former and path segment latencies for the latter. Our tech- 
niques for inferring link latencies identify and use mea- 
surements obtained by symmetric traversal of links [28], 
whereas our latency estimates of path segments do not. 
Like in iPlane [30], our latency estimates for path seg- 
ments are obtained by just subtracting RTTs measured in 
traceroutes. The fact that Vivaldi produces better latency 
estimates in the tail than both iNano and path composi- 
tion shows the significant room for improvement in our 
latency estimates for both links and path segments. 


Applications such as peer selection and detour routing 
benefit from the ability to discern which destinations have 
low latency from a source. We therefore also assess la- 
tency estimation from the perspective of ranking different 
destinations in terms of latency from a common source. 
To quantify each technique’s predictive ability on this cri- 
terion, we use the following metric. From each source, 
we determine the 10 closest nodes in terms of actual mea- 
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Figure 8: Accuracy of loss rate estimates along paths to arbi- 
trary destinations. 


sured RTT among the 100 destinations per source in our 
validation set. We then do the same using estimated la- 
tencies and compute the intersection between the actual 
and predicted sets of 10 closest nodes. Figure 7 plots the 
cardinality of this intersection for each source 1n our val- 
idation set. iNano’s ability to rank paths is significantly 
better than that of Vivaldi, while being comparable to the 
path-based approach. 

We next consider how well iNano can predict loss 
rates. We measured the loss rates along each of our vali- 
dation paths and also measured the loss rate of each inter- 
cluster link in our atlas. We then use iNano to estimate 
the loss rate by composing the loss rates of the links along 
the predicted forward and reverse paths. Figure 8 plots 
the accuracy of iNano’s loss rate estimates. Since coordi- 
nate systems, such as Vivaldi, can only estimate latency, 
we restrict our comparison to iPlane’s path composition 
technique in the case of loss rate. iNano approximates 
path-based estimates with a much smaller atlas. 


7 Applications 


Our motivation in building iNano is to provide informa- 
tion on Internet paths to peer-to-peer applications. There- 
fore, we investigate the utility of the iNano library by 
using it in three sample peer-to-peer applications—peer- 
to-peer file transfer, voice-over-IP, and detour routing 
around failures. 


7.1 P2P file transfer 


The next generation of content distribution networks 
(CDNs) are moving away from server-based deployments 
to client-based models. In contrast to services like Aka- 
mai [6], several alternatives [45, 38, 25] have recently 
emerged that perform content delivery by utilizing client 
end-hosts for storage and bandwidth. In such client-based 
CDNs, which are not centrally managed, a common prob- 
lem is to determine the best replica for a given client. 
iNano enables clients to make this decision locally. 

To evaluate the utility of Nano in client-based content- 
delivery systems, we emulated such a system as follows. 
We considered 199 PlanetLab nodes as clients. We re- 
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(b) 
Figure 9: Evaluation of peer selection in a peer-to-peer file 
transfer system for file sizes of (a) 30KB and (b) 1.5MB. Each 
point is a median of 10 samples, with each sample obtained with 
a different randomly selected set of replicas. 


solved an Akamai-zed DNS name from these nodes to 
discover 199 Akamai servers. For each client, we then 
determined the set of replicas that host the content of its 
interest by choosing 5 Akamai servers at random °, inde- 
pendently for every client. We then determined the best 
replica for every client using four different sources of 
path information—1) measured latencies, 2) latency es- 
timates from Vivaldi [13], 3) latency estimates from OA- 
SIS [18], a server-selection system used by many CDNs 
deployed on PlanetLab, and 4) latency and loss rate es- 
timates from iNano. We also consider the strategy of 
choosing replicas at random. We evaluated each strat- 
egy by downloading from every client a file from each 
replica. We compare the download times for each strat- 
egy with the optimal, which is the minimum of the down- 
load times from the 5 replicas associated with the client. 
Figure 9 shows the results of this experiment. First, we 
downloaded a 30KB file wherein we only used iNano’s 
estimates of path latency, because short TCP transfers 
are dominated by latency [8]. iNano closely tracks the 
performance obtained with measured latencies and is sig- 
nificantly better than the performance obtained with the 
use of Vivaldi or OASIS. We then repeated this experi- 


>We used such a setup instead of using PlanetLab nodes as replicas 
because the locations of PlanetLab nodes are hard-coded into OASIS. 
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Figure 10: Evaluation of relay selection for voice-over-IP using 
iNano’s estimates of latency and loss rate. 


ment for a 1.5MB file. In this setting, we use iNano’s 
latency and loss rate estimates in combination to choose 
the replica that would maximize TCP throughput based 
on the PFTK model [37]. iNano’s predictions of loss rates 
enable it to choose replicas that deliver significantly bet- 
ter download performance than that obtained using mea- 
sured latencies. Vivaldi and OASIS, restricted to model- 
ing path latency, continue to yield poorer performance. 

Unlike our experimental evaluation, in practice, a P2P 
CDN may perform a transfer in parallel across multiple 
paths assuming that at least one of those paths will pro- 
vide good performance. iNano can be of benefit to such 
applications in two ways. First, in applications that trans- 
mit video, iNano can reduce the bootstrapping time for 
the video to load by helping prune down a potentially 
large set of path alternatives to a small set of good paths 
used for the transfer, without performing any measure- 
ments. Second, by enabling the application to focus in 
on the good paths quickly, iNano reduces the redundant 
traffic sent by the application that either gets dropped on 
lossy paths or is used just for measurement. 


7.2  Voice-over-IP 


Voice-over-IP (VoIP) has emerged as a popular peer-to- 
peer application in recent years. VoIP applications such 
as Skype [52] allow end-hosts that are both behind NATs 
to talk to each other by routing packets via another end- 
host that serves as a relay. Picking the right relay is vital 
to ensure reasonable quality of the end-to-end call [46]. 
We emulated a VoIP application by considering 119 
PlanetLab nodes as representative end-hosts. We chose 
1200 (source, destination) pairs at random and emulated 
a VoIP call between each such pair by sending a |OKBps 
constant bitrate UDP packet stream from the source to 
the destination. For each call, we consider all end-hosts 
other than the source and destination to be potential re- 
lays. We use iNano to pick the 10 relays that minimize 
the predicted loss rate and then choose the one amongst 
these that minimizes end-to-end latency. We compare this 
strategy of choosing relays with three other strategies— 
1) closest to source based on measured latency, 2) closest 
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Figure 11: Ability to route around failures using iNano’s path 
predictions and using detour nodes at random. Note y axis is on 
log scale to the base 2. 


to destination based on measured latency, and 3) random. 

Figure 10 compares the quality of the relay nodes cho- 
sen by using iNano’s estimates of latency and loss rate 
with the choices made using the other strategies. Paths 
via relay nodes chosen by iNano see significantly less 
packet loss compared to the alternatives. 


7.3. Detouring around failures 


Several Internet measurement studies [40, 60, 15, 20] 
have shown that the typical availability of an Internet 
path is “two-nines”, i.e., 99%. This level of availability 
falls well short of that measured for the telephone net- 
work [26]. One of the solutions proposed to mitigate 
this problem is detour routing [48]. When a source is 
unable to reach a destination, the source can attempt to 
contact the destination instead by routing its packets via 
another end-host that serves as a detour. Previous solu- 
tions for improving availability with detour routing im- 
plement one of three approaches—1) constantly moni- 
tor paths between all pairs of end-hosts [7], 2) constantly 
monitor paths between all pairs of detour nodes [1] and 
have end-hosts route through nearby detour nodes, or 3) 
detour via a small randomly chosen set of end-hosts [20]. 
All-pairs monitoring is infeasible at Internet-scale, mon- 
itoring paths only between detour nodes ignores failures 
on paths from end-hosts to nearby detour nodes, and a 
small randomly chosen set of detours will not suffice for 
widespread outages. 

We explore a new way of routing around failures by 
choosing detour nodes that maximize the disjointness be- 
tween the detour path and the direct path. When a source 
is unable to reach a destination, we use iNano to predict 
the direct path from the source to the destination as well 
as the detour path via each of the available intermedi- 
aries. We then rank the detour paths based on the number 
of PoPs and ASes shared by their predicted paths. We 
choose the (k + 1)*” detour node in this ranking to be the 
one that minimizes first the number of PoPs and second 
the number of ASes in common with the direct path and 
the k previously chosen detours. A strategy for recover- 
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ing from failures by using N detours would try the first 
N detours in the ranking; the lower the value of N the 
less overhead incurred. 

To compare the efficacy of the above strategy for rout- 
ing around failures with SOSR’s [20] strategy of using a 
few detours at random, we gathered the following mea- 
surements of path availability. We used 35 PlanetLab 
nodes and performed traceroutes continually for a week 
from each of them to destinations in 1000 randomly cho- 
sen prefixes, once every 15 minutes. Whenever a Planet- 
Lab node was unable to reach a destination, we measured 
the availability of the detour path via the other 34 Plan- 
etLab nodes. We consider for our analysis only the cases 
when at least 10% of our sources were simultaneously 
unable to reach the destination but at least 10% could. 

Figure 11 compares our ability to route around failures 
by intelligently choosing detours using iNano’s path pre- 
dictions as opposed to choosing detours at random. For 
the same number of detour paths, using iNano reduces 
the fraction of cases when the destination is unreachable 
by roughly a factor of 2. For example, the use of 5 detour 
paths leaves the destinations unreachable in 2% of cases 
compared to 4% of cases with the random strategy. 


$ Related Work 


Our work benefits from a decade of work in Internet 
performance prediction [49, 17] and network measure- 
ment [51, 55]. Compared to most prior work, our goal 
is different: accurate prediction of sophisticated Internet 
performance metrics from lightweight end-hosts, which 
requires us to aggressively explore the trade-off between 
accuracy and representation size. 


8.1 Latency prediction 


IDMaps [17] pioneered the idea of a network informa- 
tion service that provides latency information between ar- 
bitrary end-hosts on the Internet. [DMaps issues pings 
from a set of vantage points to all participating end- 
hosts and also measures latencies between all pairs of 
vantage points. As more vantage points are added, the 
size of [DMaps’ measurement data grows proportional to 
the square of the number of vantage points. Therefore, 
IDMaps uses a spanner-graph representation to compress 
its data. iNano tackles a different compression problem, 
that of compactly representing information encoded in 
the forwarding tables of all routers in the Internet. 

Ng et al. [35] showed that Internet nodes could be em- 
bedded in a Euclidean coordinate space. The strength 
of the approach is that it is 1) simple because it treats 
the underlying network as a blackbox, and 2) lightweight 
because only a few bytes of coordinates per node need 
be stored. A large body of work has since refined this 
basic approach to provide decentralization [36, 13], im- 
proved computational efficiency [56], resilience to mea- 
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surement error [12, 13], security [12], and accuracy. 
The techniques used to minimize error include Simplex 
minimization [12, 36], Principal Component Analysis 
(PCA) [27, 56], and spring relaxation [13]. 

The network coordinates approach poses two problems 
for our goals. First, the approach has been shown capable 
of predicting latencies, but it is unclear how to adapt the 
approach to other metrics that do not obey linear compo- 
sition, such as loss rate. Second, the approach is funda- 
mentally limited in accuracy. For example, about half of 
all Internet routes are known to be asymmetric [40] and 
a significant fraction are known to possess shorter detour 
routes [48]. However, common embedding techniques 
based on metric spaces will predict symmetric latencies 
and fail to predict detour routes when triangle inequality 
is violated. This limits the applicability of the approach 
for many applications. 


8.2 Prediction of multiple metrics 


Sequoia [43] attempts to embed nodes on to a “virtual 
prediction tree’. Edges of the tree are annotated with 
latency and the latency between two nodes 1s predicted 
as the length of the path connecting them. Unlike other 
coordinate systems, Sequoia is also extensible to band- 
width. However, it continues to use metric embeddings 
that predict symmetric routes with no detour routes. Aka- 
mai’s SureRoute [1] service optimizes transfers between 
end-hosts by routing through a mesh of detour nodes. 
End-hosts are routed through nearby detour nodes and 
the optimal path through the mesh of detour nodes is de- 
termined by constant monitoring. However, the perfor- 
mance along a path between two end-hosts is not nec- 
essarily the same as on the path via their nearby detour 
nodes. 


8.3 Structural inference 


iNano’s structural inference approach has been previ- 
ously used in iPlane. However, unlike iNano, iPlane 
adopts a centralized architecture that scales poorly to 1) 
Internet-scale query loads, and 2) more vantage points. 
iPlane uses an atlas of observed paths, whose size is pro- 
portional to the number of vantage points times the num- 
ber of destinations probed times the average path length. 
With 1Plane’s current set of vantage points and destina- 
tions, the size of its atlas is already over 1GB. As more 
vantage points contribute measurements, iPlane’s accu- 
racy will increase, but at the cost of blowing up the size 
of its atlas. iPlane’s large atlas has the implication that 
its query engine can only be hosted on dedicated servers 
but not on typical end-hosts. iNano’s atlas instead com- 
prises link-level, not path-level, information of the Inter- 
net structure. Routing policies encoded in iPlane’s set of 
observed paths are replaced by iNano’s compact repre- 
sentation of the same. 


8.4 AS path inference 


iNano’s main focus 1s on predicting path performance be- 
tween arbitrary end-hosts, while predicting the path be- 
tween them. Prior work has looked at a part of this prob- 
lem, inference of AS paths. 

Mao et al. [32] describe a structural inference ap- 
proach, RouteScope, to infer AS-level paths. They use 
constrained optimization to model aspects of interdo- 
main policy routing such as customer<peer<provider 
and valley-free routing, and use additional measurement 
techniques to observe routes from multihomed prefixes. 
Our evaluation in Section 6 shows that iNano’s ability 
to predict AS paths is significantly better than that of 
RouteScope, with iNano predicting the AS path correctly 
for more than twice as many paths in our validation set. 

Qiu and Gao [42] build on RouteScope by using 
observed AS paths as constraints in predicting paths. 
Muhlbauer et al. [34] attempt to develop a hybrid model 
of Internet routing that lies in between a blackbox and 
a structure inference approach. They introduce “quasi- 
routers” to model the presence of multiple border routers 
in an AS based on an observed set of routes. Their ap- 
proach can predict the training set exactly and achieves 
50% prediction accuracy for unobserved routes. Both 
these pieces of work require a set of AS paths to make 
predictions; an atlas of paths is not compact enough to 
serve iNano’s goal of distributing the atlas to end-hosts. 


9 Conclusions 


Our contribution is a practical one. Today, there is 
a gap between research techniques for Internet perfor- 
mance prediction, and the scalability and low-overhead 
desired by large-scale P2P applications. iPlane Nano is a 
lightweight Internet path performance prediction engine 
that applications can use today at low cost. To make this 
work, we develop a model of Internet routing that can 
predict PoP-level paths between arbitrary end-hosts with 
an atlas that is less than 7MB in size and can be updated 
with roughly 1MB/day. The compact nature of the at- 
las enables applications to have their clients download 
the atlas and process queries locally. Furthermore, be- 
cause the atlas is the same for all end-hosts, it can be 
disseminated to clients at low cost by using common P2P 
filesharing protocols, and thus largely using client band- 
widths. Our evaluation of iPlane Nano demonstrated the 
accuracy of its predictions and its utility in improving the 
performance of P2P applications. 
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Abstract 


This paper argues for a new approach to building Byzan- 
tine fault tolerant replication systems. We observe that 
although recently developed BFT state machine replica- 
tion protocols are quite fast, they don’t tolerate Byzantine 
faults very well: a single faulty client or server is capa- 
ble of rendering PBFT, Q/U, HQ, and Zyzzyva virtually 
unusable. In this paper, we (1) demonstrate that exist- 
ing protocols are dangerously fragile, (2) define a set of 
principles for constructing BFT services that remain use- 
ful even when Byzantine faults occur, and (3) apply these 
principles to construct a new protocol, Aardvark. Aard- 
vark can achieve peak performance within 40% of that of 
the best existing protocol in our tests and provide a sig- 
nificant fraction of that performance when up to f servers 
and any number of clients are faulty. We observe useful 
throughputs between 11706 and 38667 requests per sec- 
ond for a broad range of injected faults. 


1 Introduction 


This paper is motivated by a simple observation: al- 
though recently developed BFT state machine replica- 
tion protocols have driven the costs of BFT replication 
to remarkably low levels [1, 8, 12, 18], the reality is that 
they don’t tolerate Byzantine faults very well. In fact, a 
single faulty client or server can render these systems ef- 
fectively unusable by inflicting multiple orders of mag- 
nitude reductions in throughput and even long periods 
of complete unavailability. Performance degradations of 
such degree are at odds with what one would expect from 
a system that calls itself Byzantine fault tolerant—after 
all, if a single fault can render a system unavailable, can 
that system truly be said to tolerate failures? 

To illustrate the the problem, Table 1 shows the mea- 
sured performance of a variety of systems both in the 
absence of failures and when a single faulty client sub- 
mits a carefully crafted series of requests. As we show 
later, a wide range of other behaviors—faulty primaries, 
recovering replicas, etc.—can have a similar impact. We 


believe that these collapses are byproducts of a single- 
minded focus on designing BFT protocols with ever 
more impressive best-case performance. While this fo- 
cus is understandable—after years in which BFT repli- 
cation was dismissed as too expensive to be practical, 
it was important to demonstrate that high-performance 
BFT is not an oxymoron—it has led to protocols whose 
complexity undermines robustness in two ways: (1) the 
protocols’ design includes fragile optimizations that al- 
low a faulty client or server to knock the system off of 
the optimized execution path to an expensive alternative 
path and (2) the protocols’ implementation often fails to 
handle properly all of the intricate corner cases, so that 
the implementations are even more vulnerable than the 
protocols appear on paper. 

The primary contribution of this paper is to advocate a 
new approach, robust BFT (RBFT), to building BFT sys- 
tems. Our goal is to change the way BFT systems are de- 
signed and implemented by shifting the focus from con- 
structing high-strung systems that maximize best case 
performance to constructing systems that offer accept- 
able and predictable performance under the broadest pos- 
sible set of circumstances—including when faults occur. 


PPBFTis] | ci7l0 | 0 
eon [23860] 
ZymyvaT@T [65099 | 


Table 1: Observed peak throughput of BFT systems in a 
fault-free case and when a single faulty client submits a 
carefully crafted series of requests. We detail our mea- 
surements in Section 7.2. ' The result reported for Q/U is 
for correct clients issuing conflicting requests. * The HQ 
prototype demonstrates fault-free performance and does 
not implement many of the error-handling steps required 
to handle inconsistent MACs. 
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RBFT explicitly considers performance during both 
gracious intervals—when the network is synchronous, 
replicas are timely and fault-free, and clients correct— 
and uncivil execution intervals in which network links 
and correct servers are timely, but up to f = |“>+| 
servers and any number of clients are faulty. The last 
row of Table 1 shows the performance of Aardvark, an 
RBFT state machine replication protocol whose design 
and implementation are guided by this new philosophy. 

In some ways, Aardvark is very similar to traditional 
BFT protocols: clients send requests to a primary who 
relays requests to the replicas who agree (explicitly or 
implicitly) on the sequence of requests and the corre- 
sponding results—not unlike PBFT [8], High through- 
put BFT [19], Q/U [1], HQ [12], Zyzzyva [18], ZZ [32], 
Scrooge [28], etc. 

In other ways, Aardvark is very different and chal- 
lenges conventional wisdom. Aardvark utilizes signa- 
tures for authentication, even though, as Castro correctly 
observes, “eliminating signatures and using MACs in- 
stead eliminates the main performance bottleneck in pre- 
vious systems” [7]. Aardvark performs regular view 
changes, even though view changes temporarily prevent 
the system from doing useful work. Aardvark utilizes 
point to point communication, even though renouncing 
[P-multicast gives up throughput deliberately. 

We reach these counter-intuitive choices by following 
a simple and systematic approach: without ever compro- 
mising safety, we deliberately refocus both the design 
of the system and the engineering choices involved in 
its implementation on the stress that failures can impose 
on performance. In applying this strategy for RBFT to 
construct Aardvark, we choose an extreme position 1n- 
spired by maxi-min strategies in game theory [26]: we 
reject any optimization for gracious executions that can 
decrease performance during uncivil executions. 

Surprisingly, these counter-intuitive choices impose 
only a modest cost on its peak performance. As Table 1 
illustrates, Aardvark sustains peak throughput of 38667 
requests/second, which is within 40% of the best perfor- 
mance we measure on the same hardware for four stat- 
of-the-art protocols. At the same time, Aardvark’s fault 
tolerance is dramatically improved. For a broad range 
of client, primary, and server misbehaviors we prove that 
Aardvark’s performance remains within a constant fac- 
tor of its best case performance. Testing of the prototype 
shows that these changes significantly improve robust- 
ness under a range of injected faults. 

Once again, however, the main contribution of this pa- 
per is neither the Aardvark protocol nor implementation. 
It is instead a new approach that can—and we believe 
should—be applied to the design of other BFT protocols. 
In particular, we (1) demonstrate that existing protocols 
and their implementations are fragile, (2) argue that BFT 
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protocols should be designed and implemented with a fo- 
cus on robustness, and (3) use Aardvark to demonstrate 
that the RBFT approach is viable: we gain qualitatively 
better performance during uncivil intervals at only mod- 
est cost to performance during gracious intervals. 

In Section 2 we describe our system model and the 
guarantees appropriate for high assurance systems. In 
Section 3 we elaborate on the need to rethink Byzan- 
tine fault tolerance and identify a set of design principles 
for RBFT systems. In Section 4 we present a system- 
atic methodology for designing RBFT systems and an 
overview of Aardvark. In Section 5 we describe in detail 
the important components of the Aardvark protocol. In 
Section 6 we present an analysis of Aardvark’s expected 
performance. In Section 7 we present our experimental 
evaluation. In Section 8 we discuss related work. 


2 System model 


We assume the Byzantine failure model where faulty 
nodes (servers or clients) can behave arbitrarily [21] and 
a strong adversary can coordinate faulty nodes to com- 
promise the replicated service. We do, however, assume 
the adversary cannot break cryptographic techniques like 
collision-resistant hashing, message authentication codes 
(MACs), encryption, and signatures. We denote a mes- 
sage X signed by principal p’s public key as (X),,. We 
denote a message X with a MAC appropriate for princi- 
pals p and r as (X’),,,.,,. We denote a message containing 
a MAC authenticator—an array of MACs appropriate for 
verification by every replica—as (.X) 7. 

Our model puts no restriction on clients, except that 
their number be finite: in particular, any number of 
clients can be arbitrarily faulty. However, the system’s 
safety and liveness properties are guaranteed only if at 
most f = [2 | servers are faulty. 

Finally, we assume an asynchronous network where 
synchronous intervals, during which messages are deliv- 
ered with a bounded delay, occur infinitely often. 


Definition 1 (Synchronous interval). During a syn- 
chronous interval any message sent between correct pro- 
cesses is delivered within a bounded delay T if the sender 
retransmits according to some schedule until it is deliv- 
ered. 


3 Recasting the problem 


The foundation of modern BFT state machine replication 
rests on an impossibility result and on two principles that 
assist us in dealing with it. The impossibility result, of 
course, 1s FLP [13], which states that no solution to con- 
sensus can be both safe and live in an asynchronous sys- 
tems if nodes can fail. The two principles, first applied 
by Lamport to his Paxos protocol [20], are at the core 
of Castro and Liskov’s seminal work on PBFT [7]. The 
first states that synchrony must not be needed for safety: 
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as long as a threshold of faulty servers is not exceeded, 
the replicated service must always produce linearizable 
executions, independent of whether the network loses, 
reorders, or arbitrarily delays messages. The second rec- 
ognizes, given FLP, that synchrony must play a role in 
liveness: clients are guaranteed to receive replies to their 
requests only during intervals in which messages sent to 
correct nodes are received within some fixed (but poten- 
tially unknown) time interval from when they are sent. 
Within these boundaries, the engineering of BFT pro- 
tocols has embraced Lampson’s well-known recommen- 
dation: “Handle normal and worst case separately as a 
rule because the requirements for the two are quite dif- 
ferent. The normal case must be fast. The worst case 
must make some progress” [22]. Ever since PBFT, the 
design of BFT systems has then followed a predictable 
pattern: first, characterize what defines the normal (com- 
mon) case; then, pull out all the stops to make the system 
perform well for that case. While different systems don’t 
completely agree on what defines the common case [16], 
on one point they are unanimous: the common case in- 
cludes only gracious executions, defined as follows: 


Definition 2 (Gracious execution). An execution is gra- 
cious iff (a) the execution is synchronous with some 
implementation-dependent short bound on message de- 
lay and (b) all clients and servers behave correctly. 


The results of this approach continue to be spectac- 
ular. Since Zyzzyva last year reported a throughput of 
over 85,000 null requests per second [18], several new 
protocols have further improved on that mark [16, 28]. 

Despite these impressive results, we argue that a sin- 
gle minded focus on aggressively tuning BFT systems 
for the best case of gracious execution, a practice that 
we have engaged in with relish [18], is increasingly mis- 
guided, dangerous, and even futile. 

It is misguided, because it encourages the design and 
implementation of systems that fail to deliver on their ba- 
sic promise: to tolerate Byzantine faults. While provid- 
ing impressive throughput during gracious executions, 
today’s high-performance BFT systems are content to 
guaranteeing weak liveness guarantees (e.g. “eventual 
progress’) in the presence of Byzantine failures. Unfor- 
tunately, as we previewed in Figure | and show in detail 
in Section 7.2, these guarantees are weak indeed. AI- 
though current BFT systems can survive Byzantine faults 
without compromising safety, we contend that a system 
that can be made completely unavailable by a simple 
Byzantine failure can hardly be said to tolerate Byzan- 
tine faults. 

It is dangerous, because it encourages fragile opti- 
mizations. Fragile optimizations are harmful in two 
ways. First, as we will see in Section 7.2, they make it 
easier for a faulty client or server to knock the system off 


its hard-won optimized execution path and enter an alter- 
native, much more expensive one. Second, they weigh 
down the system with subtle corner cases, increasing the 
likelihood of buggy or incomplete implementations. 

It is (G@ncreasingly) futile, because the race to optimize 
common case performance has reached a point of dimin- 
ishing return where many services’ peak demands are al- 
ready far under the best-case throughput offered by ex- 
isting BFT replication protocols. For such systems, good 
enough is good enough, and further improvements in best 
case agreement throughput will have little effect on end- 
to-end system performance. 

In our view, a BFT system fulfills its obligations 
when it provides acceptable and dependable performance 
across the broadest possible set of executions, including 
executions with Byzantine clients and servers. In par- 
ticular, the temptation of fragile optimizations should be 
resisted: a BFT system should be designed around an 
execution path that has three properties: (1) it provides 
acceptable performance, (2) it is easy to implement, and 
(3) it is robust against Byzantine attempts to push the sys- 
tem away from it. Optimizations for the common case 
should be accepted only as long as they don’t endanger 
these properties. 

FLP tells us that we cannot guarantee liveness in an 
asynchronous environment. This is no excuse to cling to 
gracious executions only. In particular, there is no theo- 
retical reason why BFT systems should not be expected 
to perform well in what we call uncivil executions: 


Definition 3 (Uncivil execution). An execution is 
uncivil iff (a) the execution is synchronous with some 
implementation-dependent short bound on message de- 
lay, (b) up to f servers and an arbitrary number of clients 
are Byzantine, and (c) all remaining clients and servers 
are correct. 


Hence, we propose to build RBFT systems that pro- 
vide adequate performance during uncivil executions. 
Although we recognize that this approach is likely to re- 
duce the best case performance, we believe that for a 
BFT system a limited reduction in peak throughput is 
preferable to the devastating loss of availability that we 
report in Figure | and Section 7.2. 

Increased robustness may come at effectively no ad- 
ditional cost as long as a service’s peak demand is be- 
low the throughput achievable through RBFT design: 
as a data point, our Aardvark prototype reaches a peak 
throughput of 38667 req/s. 

Similarly, when systems have other bottlenecks, Am- 
dahl’s law limits the impact of changing the performance 
of agreement. For example, we report in Section 7 that 
PBFT can execute almost 62,000 null requests per sec- 
ond, suggesting that agreement consumes 16.1 ,15 per re- 
quest. If, rather than a null service, we replicate a service 
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for which executing an average request consumes 100s 
of processing time, then peak throughput with PBFT set- 
tles to about 8613 requests per second. For the same ser- 
vice, a protocol with twice the agreement overhead of 
PBFT (.e., 32.2us per request), would still achieve peak 
throughput of about 7564 requests/second: in this hy- 
pothetical example, doubling agreement overhead would 
reduce peak end-to-end throughput by about 12%. 


4 Aardvark: RBFT in action 


Aardvark is a new BFT system designed and imple- 
mented to be robust to failures. The Aardvark pro- 
tocol consists of 3 stages: client request transmission, 
replica agreement, and primary view change. This is the 
same basic structure of PBFT [8] and its direct descen- 
dants [4, 18, 19, 33, 32], but revisited with the goal of 
achieving an execution path that satisfies the properties 
outlined in the previous section: acceptable performance, 
ease of implementation, and robustness against Byzan- 
tine disruptions. To avoid the pitfalls of fragile opti- 
mizations, we focus at each stage of the protocol on how 
faulty nodes, by varying both the nature and the rate of 
their actions and omissions, can limit the ability of cor- 
rect nodes to perform in a timely fashion what the proto- 
col requires of them. This systematic methodology leads 
us to the three main design differences between Aardvark 
and previous BFT systems: (1) signed client requests, (2) 
resource isolation, and (3) regular view changes. 


Signed client requests. Aardvark clients use digital 
signatures to authenticate their requests. Digital signa- 
tures provide non-repudiation and ensure that all correct 
replicas make identical decisions about the validity of 
each client request, eliminating a number of expensive 
and tricky corner cases found in existing protocols that 
make use of weaker (though faster) message authentica- 
tion code (MAC) authenticators [7] to authenticate client 
requests. The difficulty with utilizing MAC authentica- 
tors is that they do not provide the non-repudiation prop- 
erty of digital signatures—one node validating a MAC 
authenticator does not guarantee that any other nodes 
will validate that same authenticator [2]. 

As we mentioned in the Introduction, digital signa- 
tures are generally seen as too expensive to use. Aard- 
vark uses them only for client requests where it is pos- 
sible to push the expensive act of generating the signa- 
ture onto the client while leaving the servers with the 
less expensive verification operation. Primary-to-replica, 
replica-to-replica, and replica-to-client communication 
rely on MAC authenticators. The quorum-driven nature 
of server-initiated communication ensures that a single 
faulty replica is unable to force the system into undesir- 
able execution paths. 

Because of the additional costs associated with verify- 
ing signatures in place of MACs, Aardvark must guard 
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Figure 1: Physical network in Aardvark. 


against new denial-of-service attacks where the system 
receives a large numbers of requests with signatures that 
need to be verified. Our implementation limits the num- 
ber of signature verifications a client can inflict on the 
system by (1) utilizing a hybrid MAC-signature construct 
to put a hard limit on the number of faulty signature veri- 
fications a client can inflict on the system and (2) forcing 
a client to complete one request before issuing the next. 


Resource isolation. The Aardvark prototype imple- 
mentation explicitly isolates network and computational 
resources. 

As illustrated by Fig. 1, Aardvark uses separate net- 
work interface controllers (NICs) and wires to connect 
each pair of replicas. This step prevents a faulty server 
from interfering with the timely delivery of messages 
from good servers, as happened when a single broken 
NIC shut down the immigration system at the Los An- 
geles International Airport [9]. It also allows a node to 
defend itself against brute force denial of service attacks 
by disabling the offending NIC. However, using phys- 
ically separate NICs for communication between each 
pair of servers incurs a performance hit, as Aardvark can 
no longer use hardware multicast to optimize all-to-all 
communication. 

As Figure 2 shows, Aardvark uses separate work 
queues for processing messages from clients and indi- 
vidual replicas. Employing a separate queue for client 
requests prevents client traffic from drowning out the 
replica-to-replica communications required for the sys- 
tem to make progress. Similarly, employing a sepa- 
rate queue for each replica allows Aardvark to sched- 
ule message processing fairly, ensuring that a replica is 
able to efficiently gather the quorums it needs to make 
progress. Aardvark can also easily leverage separate 
hardware threads to process incoming client and replica 
requests. Taking advantage of hardware parallelism al- 
lows Aardvark to reclaim part of the costs paid to verify 
signatures on client requests. 
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Figure 2: Architecture of a single replica. The replica 
utilizes a separate NIC for communicating with each 
other replica and a final NIC to communicate with the 
collection of clients. Messages from each NIC are placed 
on separate worker queues. 





We use simple brute force techniques for resource 
scheduling. One could consider network-level schedul- 
ing techniques rather than distinct NICs in order to 1so- 
late network traffic and/or allow rate-limited multicast. 
Our goal is to make Aardvark as simple as possible, so 
we leave exploration of these techniques and optimiza- 
tions for future work. 


Regular view changes. To prevent a primary from 
achieving tenure and exerting absolute control on sys- 
tem throughput, Aardvark invokes the view change op- 
eration on a regular basis. Replicas monitor the perfor- 
mance of the current primary, slowly raising the level of 
minimal acceptable throughput. If the current primary 
fails to provide the required throughput, replicas initiate 
a view change. 
The key properties of this technique are: 

1. During uncivil intervals, system throughput remains 
high even when replicas are faulty. Since a primary 
maintains its position only if it achieves some increas- 
ing level of throughput, Aardvark bounds throughput 
degradation caused by a faulty primary by either forc- 
ing the primary to be fast or selecting a new primary. 


2. AS in prior systems, eventual progress is guaranteed 
when the system is eventually synchronous. 

Previous systems have treated view change as an op- 
tion of last resort that should only be used in desperate 
situations to avoid letting throughput drop to zero. How- 
ever, although the phrase “view change” carries conno- 
tations of a complex and expensive protocol, in reality 
the cost of a view change is similar to the regular cost 
of agreement. Performing view changes regularly intro- 
duces short periods of time during which new requests 
are not being processed, but the benefits of evicting a 
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Figure 3: Basic communication pattern in Aardvark. 


misbehaving primary outweigh the periodic costs associ- 
ated with performing view changes. 


5 Protocol description 


Figure 3 shows the agreement phase communication pat- 
tern that Aardvark shares with PBFT. Variants of this 
pattern are employed in other recent BFT RSM proto- 
cols [1, 12, 16, 18, 28, 32, 33], and we believe that, just 
as Aardvark illustrates how to adapt PBFT via RBFT 
system design, new Robust BFT systems based on these 
other protocols can and should be constructed. We orga- 
nize the following discussion around the numbered steps 
of the communication pattern of Figure 3. 


5.1 Client request transmission 


The fundamental challenge in transmitting client re- 
quests is ensuring that, upon receiving a client request, 
every replica comes to the same conclusion about the 
authenticity of the request. We ensure this property by 
having clients sign requests. 

To guard against denial of service, we break the pro- 
cessing of a client request into a sequence of increasingly 
expensive steps. Each step serves as a filter, so that more 
expensive steps are performed less often. For instance, 
we ask clients to include also a MAC on their signed 
requests and have replicas verify only the signature of 
those requests whose MAC checks out. Additionally, 
Aardvark explicitly dedicates a single NIC to handling 
incoming client requests so that incoming client traffic 
does not interfere with replica-to-replica communication. 


5.1.1 Protocol Description 


The steps taken by an Aardvark replica to authenticate a 
client request follow. 


1. Client sends a request to a replica. 


A client c requests an operation o be performed by the 
replicated state machine by sending a request message 
((REQUEST, 0, 8,C)¢,, C),., to the replica p it believes 
to be the primary. If the client does not receive a timely 
response to that request, then the client retransmits the re- 
quest ((REQUEST, 0, 8, C)o,, C),,, to all replicas r. Note 
that the request contains the client sequence number s 
and is signed with signature o,. The signed message is 
then authenticated with a MAC j.., for the intended re- 
cipient. 
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(a) Blacklist _!2!! 


Check Discard 


pass 


fail ; 
S Discard 


pass 


(c1) fail 
Retransmission 
Check 


(c) Sequence\_fail 
Check 


Discard 


pass pass 


Retransmit 
Cached Reply 


fail -~(d) Redundancy 
Check 


pass 
(e) fail 


Signature 
Check 


Blacklist Sender Discard 


pass 


(f) Once per il 


View Check Discard 


pass 


Act on Request 


Figure 4: Decision tree followed by replicas while veri- 
fying a client request. The narrowing width of the edges 
portrays the devastating losses suffered by the army of 
client requests as it marches through the steppes of the 
verification process. Apologies to Minard. 


Upon receiving a client request, a replica proceeds to 
verify it by following a sequence of steps designed to 
limit the maximum load a client can place on a server, as 
illustrated by Figure 4: 


(a) Blacklist check. If the sender c is not blacklisted, then 


proceed to step (b). Otherwise discard the message. 


(b) MAC check. If ju.) is valid, then proceed to step (c). 


Otherwise discard the message. 


(c) Sequence check. Examine the most recent cached re- 


ply to c with sequence number s,¢-n-. If the request 
sequence number Seq 18 exactly Scache + 1, then pro- 
ceed to step (d). Otherwise 


(cl) Retransmission check. Each replica uses an ex- 
ponential back off to limit the rate of client reply 
retransmissions. If a reply has not been sent to c re- 
cently, then retransmit the last reply sent to c. Oth- 
erwise discard the message. 


(d) Redundancy check. Examine the most recent cached 
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request from c. If no request from c with sequence 
number s,¢, has previously been verified or the re- 
quest does not match the cached request, then proceed 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 


to step (e). Otherwise (the request matches the cached 
request from c) proceed to step (f). 


(e) Signature check. If o, is valid, then proceed to step 


(f). Additionally, if the request does not match the 
previously cached request for s;¢,, then blacklist c. 
Otherwise if o, 1S not valid, then blacklist the node x 
that authenticated j1, ,, and discard the message. 


(f) Once per view check. If an identical request has been 


verified in a previous view, but not processed during 
the current view, then act on the request. Otherwise 
discard the message. 


Primary and non-primary replicas act on requests in 
different ways. A primary adds requests to a PRE- 
PREPARE message that is part of the three-phase com- 
mit protocol described in Section 5.2. A non-primary 
replica r processes a request by authenticating the signed 
request with a MAC ,.,. for the primary p and sending 
the message to the primary. Note that non-primary repli- 
cas will forward each request at most once per view, but 
they may forward a request multiple times provided that 
a view change occurs between each occurrence. 

Note that a REQUEST message that is verified as au- 
thentic might contain an operation that the replicated ser- 
vice that runs above Aardvark rejects because of an ac- 
cess control list (ACL) or other service-specific security 
violation. From the point of view of Aardvark, such mes- 
sages are valid and will be executed by the service, per- 
haps resulting in an application level error code. 

A node p only blacklists a sender c of a 
((REQUEST, 0, 8,C)¢,, C)u,., message if the MAC [icp 
is valid but the signature o, is not. A valid MAC is suf- 
ficient to ensure that routine message corruption is not 
the cause of the invalid signature sent by c, but rather 
that c has suffered a significant fault or is engaging in 
malicious behavior. A replica discards all messages it re- 
ceives from a blacklisted sender and removes the sender 
from the blacklist after 10 minutes to allow reintegration 
of repaired machines. 


5.1.2 Resource scheduling 


Client requests are necessary to provide input to the RSM 
while replica-to-replica communication is necessary to 
process those requests. Aardvark leverages separate 
work queues for providing client requests and replica- 
to-replica communication to limit the fraction of replica 
resources that clients are able to consume, ensuring that a 
flood of client requests is unable to prevent replicas from 
making progress on requests already received. Of course, 
as in anon-BFT service, malicious clients can still deny 
service to other clients by flooding the network between 
clients and replicas. Defending against these attacks is 
an area of active independent research [23, 30]. 

We deploy our prototype implementation on dual core 
machines. As Figure 2 shows, one core verifies client re- 
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quests and the second runs the replica protocol. This ex- 
plicit assignment allows us to isolate resources and take 
advantage of parallelism to partially mask the additional 
costs of signature verification. 


5.1.3. Discussion 


RBFT aims at minimizing the costs that faulty clients can 
impose on replicas. As Figure 4 shows, there are four ac- 
tions triggered by the transmission of a client request that 
can consume significant replica resources: MAC verifi- 
cation (MAC check), retransmission of a cached reply, 
signature verification (signature check), and request pro- 
cessing (act on request). The cost a faulty client can 
Cause increases as the request passes each successive 
check in the verification process, but the rate at which 
a faulty client can trigger this cost decreases at each step. 

Starting from the final step of the decision tree, the de- 
sign ensures that the most expensive message a client can 
send is a correct request as specified by the protocol, and 
it limits the rate at which a faulty client can trigger expen- 
sive signature checks and request processing to the max- 
imum rate a correct client would. The sequence check 
step (c) ensures that a client can trigger signature veri- 
fication or request processing for a new sequence num- 
ber only after its previous request has been successfully 
executed. The redundancy check (d) prevents repeated 
signature verifications for the same sequence number by 
caching each client’s most recent request. Finally, the 
once per view check (f) permits repeated processing of 
a request only across different views to ensure progress. 
The signature check (e) ensures that only requests that 
will be accepted by all correct replicas are processed. 
The net result of this filtering is that, for every k cor- 
rect requests submitted by a client, each replica performs 
at most & + 1 signature verifications, and any client that 
imposes a k+1*° signature verification is blacklisted and 
unable to instigate additional signature verifications until 
it is removed from the blacklist. 

Moving up the diagram, a replica responds to retrans- 
mission of completed requests paired with valid MACs 
by retransmitting the most recent reply sent to that client. 
The retransmission check (cl) imposes an exponential 
back off on retransmissions, limiting the rate at which 
clients can force the replica to retransmit a response. To 
help a client learn the sequence number it should use, a 
replica resends the cached reply at this limited rate for 
both requests that are from the past but also for requests 
that are too far into the future. 

Any request that fails the MAC check (b) is immedi- 
ately discarded. MAC verifications occur on every in- 
coming message that claims to have the right format un- 
less the sender is blacklisted, in which case the blacklist 
check (a) results in the message being discarded. The 
rate of MAC verification operations is thus limited by the 


rate at which messages purportedly from non-blacklisted 
clients are pulled off the network, and the fraction of pro- 
cessing wasted is at most the fraction of incoming re- 
quests from faulty clients. 


5.2 Replica agreement 


Once a request has been transmitted from the client to 
the current primary, the replicas must agree on the re- 
quest’s position in the global order of operations. Aard- 
vark replicas coordinate with each other using a standard 
three phase commit protocol [8]. 

The fundamental challenge in the agreement phase is 
ensuring that each replica can quickly collect the quo- 
rums of PREPARE and COMMIT messages necessary to 
make progress. Conditioning expensive operations on 
the gathering of a quorum of messages makes it eas- 
ier to ensure robustness in two ways. First, it 1s pos- 
sible to design the protocol so that incorrect messages 
sent by a faulty replica will never gain the support of a 
quorum of replicas. Second, as long as there exists a 
quorum of timely correct replicas, a faulty replica that 
sends correct messages too slowly, or not at all, cannot 
impede progress. Faulty replicas can introduce overhead 
also by sending messages too quickly: to protect them- 
selves, correct replicas in Aardvark schedule messages 
from other replicas in a round-robin fashion. 

Not all expensive operations in Aardvark are triggered 
by a quorum. In particular, a correct replica that has 
fallen behind its peers may ask them for the state it is 
missing by sending them a catchup message (see Sec- 
tion 5.2.1). Aardvark replicas defer processing such mes- 
sages to idle periods. Note that this state-transfer pro- 
cedure is self-tuning: if the system is unable to make 
progress because it cannot assemble quorums of PRE- 
PARE and COMMIT messages, then it will devote more 
time to processing catchup messages. 


5.2.1 Agreement protocol 


The agreement protocol requires replica-to-replica com- 
munication. A replica r filters, classifies, and finally acts 
on the messages it receives from another replica accord- 
ing to the decision tree shown in Figure 5: 


(a) Volume Check. If replica g is sending too many mes- 


sages, blacklist g and discard the message. Other- 
wise continue to step (b). Aardvark replicas use a dis- 
tinct NIC for communicating with each replica. Using 
per-replica NICs allows an Aardvark replica to silence 
replicas that flood the network and impose excessive 
interrupt processing load. In our prototype, we disable 
a network connection when q’s rate of message trans- 
mission in the current view is a factor of 20 higher than 
for any other replica. After disconnecting q for flood- 
ing, r reconnects g after 10 minutes, or when f other 
replicas are disconnected for flooding. 
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Figure 5: Decision tree followed by a replica when han- 
dling messages received from another replica. The width 
of the edges indicates the rate at which messages reach 
various stages in the processing. 


(b) Round-Robin Scheduler. Among the pending mes- 


(c) MAC Check. 


sages, select the the next message to process from the 
available messages in round-robin order based on the 
sending replica . Discard received messages when the 
buffers are full. 


If the selected message has a valid 
MAC, then proceed to step (d) otherwise, discard the 
message. 


(d) Classify Message. Classify the authenticated message 


according to its type: 

e If the message is PRE-PREPARE, then process it 1m- 
mediately in protocol step 3 below. 

e If the message is PREPARE or COMMIT, then add it 
to the appropriate quorum and proceed to step (e). 

e If the message is a catchup message, then proceed 
to step (f). 

e If the message is anything else, then discard the 
message. 


(ce) Quorum Check. If the quorum to which the message 
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was added is complete, then act as appropriate in pro- 
tocol steps 4-6 below. 


(f) Idle Check. If the system has free cycles, then process 


the catchup message. Otherwise, defer processing un- 

til the system is idle. 

Replica r applies the above steps to each message it 
receives from the network. Once messages are appropri- 
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ately filtered and classified, the agreement protocol con- 
tinues from step 2 of the communication pattern in Fig- 
ure 3. 


2. Primary forms a PRE-PREPARE mes- 


sage containing a set of valid requests and 
sends the PRE-PREPARE to all replicas. 





The primary creates and transmits a (PRE-PREPARE, 
v,n, (REQUEST, 0, 8,C)¢,)j7, Message where v is the 
current view number, n is the sequence number for 
the PRE-PREPARE, and the authenticator is valid for all 
replicas. Although we show a single request as part 
of the PRE-PREPARE message, multiple requests can be 
batched in a single PRE-PREPARE [8, 14, 18, 19]. 


3. Replica receives PRE-PREPARE from the 
primary, authenticates the PRE-PREPARE, 





and sends a PREPARE to all other replicas. 


Upon _ receipt _—_ of (PRE-PREPARE, v,N, 
(REQUEST, 0, 8,C)o,), from primary p, replica r 
verifies the message’s authenticity following a process 
similar to the one described in Section 5.1 for verifying 
requests. If r has already accepted the PRE-PREPARE 
message, r discards the message preemptively. If r has 
already processed a different PRE-PREPARE message 
with n’ = n during view v, then r discards the message. 
If r has not yet processed a PRE-PREPARE message for n 
during view v, r first checks that the appropriate portion 
of the MAC authenticator /i, is valid. If the replica has 
not already done so, it then checks the validity of o¢. 
If the authenticator is not valid r discards the message. 
If the authenticator is valid and the client signature 
is invalid, then the replica blacklists the primary and 
requests a view change. If, on the other hand, the 
authenticator and signature are both valid, then the 
replica logs the PRE-PREPARE message and forms a 
(PREPARE, v,n, h,r)z, to be sent to all other replicas 
where /; is the digest of the set of requests contained in 
the PRE-PREPARE message. 


4. Replica receives 2f PREPARE mes- 
sages that are consistent with the PRE- 
PREPARE message for sequence number n 


and sends a COMMIT message to all other 
replicas. 





Following receipt of 2 matching PREPARE mes- 
sages from non-primary replicas r’ that are consistent 
with a PRE-PREPARE from primary p, replica r sends 
a (COMMIT,v, 7,1) 7, message to all replicas. Note 
that the PRE-PREPARE message from the primary is the 
2f+ 1st message in the PREPARE quorum. 
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5. Replica receives 2 + 1 COMMIT mes- 
sages, commits and executes the request, 


and sends a REPLY message to the client. 





After receipt of 2 +1 matching (COMMIT,v, n, 1") ;._, 
from distinct replicas r’, replica r commits and executes 
the request before sending (REPLY, v, u,7) ,,,.,, to client c 
where wu is the result of executing the request and v is the 


current view. 


6. The client receives f + 1 matching RE- 
PLY messages and accepts the request as 


complete. 





We also support Castro’s tentative execution optimiza- 
tion [8], but we omit these details here for simplicity. 
They do not introduce any new issues for our RBFT de- 
sign and analysis. 


Catchup messages. State catchup messages are not an 
intrinsic part of the agreement protocol, but fulfill an 1m- 
portant logistical priority of bringing replicas that have 
fallen behind back up to speed. If replica r receives a 
catchup message from a replica gq that has fallen behind, 
then r sends gq the state that g to catch up and resume 
normal operations. Sending catchup messages is vital to 
allow temporarily slow replicas to avoid becoming per- 
manently non-responsive, but it also offers faulty replicas 
the chance to impose significant load on their non-faulty 
counterparts. Aardvark explicitly delays the processing 
of catchup messages until there are idle cycles available 
at a replica—as long as the system is making progress, 
processing a high volume of requests, there is no need to 
spend time bringing a slow replica up to speed! 


5.2.2 Discussion 


We now discuss the Aardvark agreement protocol 
through the lens of RBFT, starting from the bottom 
of Figure 5. Because every quorum contains at least 
a majority of correct replicas, faulty replicas can only 
marginally alter the rate at which correct replicas take 
actions (e) that require a quorum of messages. Fur- 
ther, because a correct replica processes catchup mes- 
sages (f) only when otherwise idle, faulty replicas can- 
not use catchup messages to interfere with the process- 
ing of other messages. When client requests are pend- 
ing, catchup messages are processed only if too many 
correct replicas have fallen behind and the processing 
of quorum messages needed for agreement has stalled— 
and only until enough correct replicas to enable progress 
have caught up. Also note that the queue of pending 
catchup messages 1s finite, and a replica discards excess 
catchup messages. 

A replica processes PRE-PREPARE messages at the 
rate they are sent by the primary. If a faulty primary 
sends them too slowly or too quickly, throughput may 


be reduced, hastening the transition to a new primary as 
described in Section 5.3. 

Finally, a faulty replica could simply bombard its cor- 
rect peers with a high volume of messages that are even- 
tually discarded. The round-robin scheduler (b) lim- 
its the damage that can result from this attack: if c of 
its peers have pending messages, then a correct replica 
wastes at most ; of the cycles spent checking MACs 
and classifying messages on what it receives from any 
faulty replica. The round-robin scheduler also discards 
messages that overflow a bounded buffer, and the vol- 
ume check (a) similarly limits the rate at which a faulty 
replica can inject messages that the round-robin sched- 
uler will eventually discard. 


5.3. Primary view changes 


Employing a primary to order requests enables batch- 
ing [8, 14] and avoids the need to trust clients to obey 
a back off protocol [1, 10]. However, because the pri- 
mary is responsible for selecting which requests to exe- 
cute, the system throughput is at most the throughput of 
the primary. The primary is thus in a unique position to 
control both overall system progress [3, 4] and fairness 
to individual clients. 

The fundamental challenge to safeguarding perfor- 
mance against a faulty primary is that a wide range of pri- 
mary behaviors can hurt performance. For example, the 
primary can delay processing requests, discard requests, 
corrupt clients’ MAC authenticators, introduce gaps in 
the sequence number space, unfairly delay or drop some 
clients’ requests but not others, etc. 

Hence, rather than designing specific mechanism to 
defend against each of these threats, past BFT sys- 
tems [8, 18] have relied on view changes to replace an 
unsatisfactory primary with a new, hopefully better, one. 
Past systems trigger view changes conservatively, only 
changing views when it becomes apparent that the cur- 
rent primary is unlikely to allow the system to make even 
minimal progress. 

Aardvark uses the same view change mechanism de- 
scribed in PBFT [8]; in conjunction with the agreement 
protocol, view changes in PBFT are sufficient to ensure 
eventual progress. They are not, however, sufficient to 
ensure acceptable progress. 


5.3.1 Adaptive throughput 


Replicas monitor the throughput of the current primary. 
If a replica judges the primary’s performance to be in- 
sufficient, then the replica initiates a view change. More 
specifically, replicas in Aardvark expect two things from 
the primary: a regular supply of PRE-PREPARE mes- 
sages and high sustained throughput. Following the com- 
pletion of a view change, each replica starts a heart- 
beat timer that is reset whenever the next valid PRE- 
PREPARE message is received. If a replica does not 
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receive the next valid PRE-PREPARE message before 
the heartbeat timer expires, the replica initiates a view 
change. To ensure eventual progress, a correct replica 
doubles the heartbeat interval each time the timer ex- 
pires. Once the timer is reset because a PRE-PREPARE 
message 1s received, the replica resets the heartbeat timer 
back to its initial value. The value of the heartbeat timer 
is application and environment specific: our implemen- 
tation uses a heartbeat of 40ms, so that a system that tol- 
erates f failures demands a minimum of 1 PRE-PREPARE 
every every 2/ x40ms. 


The periodic checkpoints that, at pre-determined inter- 
vals, correct replicas must take to bound their state offer 
convenient synchronization points to assess the through- 
put that the primary is able to deliver. If the observed 
throughput in the interval between two successive check- 
points falls below a specified threshold, initially 90% of 
the maximum throughput observed during the previous 
n views, the replica initiates a view change to replace the 
current primary. At each checkpoint interval following 
an initial grace period at the beginning of each view, 5s in 
our prototype, the required throughput is increased by a 
factor of 0.01. Continually raising the bar that the current 
primary must reach in order to stay in power guarantees 
that a view change will eventually be replaced, restarting 
the process with the next primary. Conversely, if the sys- 
tem workload changes, the required throughput adjusts 
over n views to reflect the performance that a correct pri- 
mary can provide. 


The combined effect of Aardvark’s new expectations 
on the primary is that during the first 5s of a view the 
primary is required to provide throughput of at least 1 re- 
quest per 40ms or face eviction. The throughput of any 
view that lasts longer than 5s is at least 90% of the max- 
imum throughput observed during the previous n views. 


5.3.2 Fairness 


In addition to hurting overall system throughput, primary 
replicas can influence which requests are processed. A 
faulty primary could be unfair to a specific client (or 
set of clients) by neglecting to order requests from that 
client. To limit the magnitude of this threat, replicas 
track fairness of request ordering. When a replica re- 
ceives from a client a request that it has not seen in a 
PRE-PREPARE message, it adds the message to its re- 
quest queue and, before forwarding the request to the 
primary, it records the sequence number & of the most re- 
cent PRE-PREPARE received during the current view. The 
replica monitors future PRE-PREPARE messages for that 
request, and if it receives two PRE-PREPARESs for another 
client before receiving a PREPARE for client c, then it de- 
clares the current primary to be unfair and initiates a view 
change. This ensures that two clients issuing comparable 
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workloads observe throughput values within a constant 
factor of each other. 


5.3.3. Discussion 


The adaptive view change and PRE-PREPARE heart- 
beats leave a faulty primary with two options: it can pro- 
vide substandard service and be replaced promptly, or it 
can remain the primary for an extended period of time 
and provide service comparable to what a non-faulty pri- 
mary would provide. A faulty primary that does not 
make any progress will be caught very quickly by the 
heartbeat timer and summarily replaced. To avoid being 
replaced, a faulty primary must issue a steady stream of 
PRE-PREPARE messages until it reaches a checkpoint 
interval, when it is going to be replaced until it has pro- 
vided the required throughput. To do just what is needed 
to keep ahead of its reckoning for as long as possible, 
a faulty primary will be forced to to deliver 95% of the 
throughput expected from a correct primary. 

Periodic view changes may appear to institutionalize 
overhead, but their cost is actually relatively small. Al- 
though the term view change evokes images of substan- 
tial restructuring, in reality a view change costs roughly 
as much as a single instance of agreement with respect 
to message/protocol complexity: when performed every 
100+ requests, periodic view changes have marginal per- 
formance impact during gracious or uncivil intervals. 


6 Analysis 


In this section, we analyze the throughput characteristics 
of Aardvark when the number of client requests is large 
enough to saturate the system and a fraction g of those 
requests is correct. We show that Aardvark’s throughput 
during long enough uncivil executions is within a con- 
stant factor of its throughput during gracious executions 
of the same length provided there are sufficient correct 
clients to saturate the servers. 

For simplicity, we restrict our attention to an Aardvark 
implementation on a single-core machine with a proces- 
sor speed of & GHz. We consider only the computational 
costs of the cryptographic operations—verifying signa- 
tures, generating MACs, and verifying MACs, requiring 
0, a, and a cycles, respectively. Since these operations 
occur only when a message is sent or received, and the 
cost of sending or receiving messages is small, we expect 
similar results when modeling network costs explicitly. 

We begin by computing Aardvark’s peak throughput 
during a gracious view, i.e. a view that occur during a 
gracious execution, in Theorem 1. We then show in 
Theorem 2 that during uncivil views, 1.e. views that oc- 
cur during uncivil executions, with a correct primary 
Aardvark’s throughput is at least g times the through- 
put achieved during a gracious view; as long as the pri- 
mary is correct faulty replicas are unable to adversely 
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impact Aardvark’s throughput. Finally, we show that the 
throughput of an uncivil execution is at least the fraction 
of correct replicas times g times the throughput achieved 
during a gracious view. 

We begin in Theorem | by computing ftpeqx, Aard- 
vark’s peak throughput during a gracious view, 1.e. a View 
that occurs during a gracious execution. We then show 
in Theorem 2 that during uncivil views in which the pri- 
mary replica is correct, Aardvark’s peak throughput is 
only reduced to g X tpeax: In other words, ignoring low 
level network overheads faulty replicas are unable to cur- 
tail Aardvark’s throughput when the primary is correct. 
Finally, we show in Theorem 3 that the throughput across 
all views of an uncivil execution is within a constant fac- 
tor of not XOX Speak: 


Theorem 1. Consider a gracious view during which 
the system is saturated, all requests come from cor- 
rect clients, and the primary generates batches of re- 
quests of size b. Aardvark’s throughput is then at least 
j, =, operations per second. 

Proof. We examine the actions required by each server 
to process one batch of size 6. For each request in the 
batch, every server verifies one signature. The primary 
also verifies one MAC per request. For each batch, the 
primary generates n— 1 MACs to send the PRE-PREPARE 
and verifies n — 1 MACs upon receipt of the PREPARE 
messages; replicas instead verify one MAC in the pri- 
mary’s PRE-PREPARE , generate (n — 1) MACs when 
they send the PREPARE messages, and verify (n — 2) 
MACs when they receive them. Finally, each server first 
sends and then receives n — 1 COMMIT messages, for 
which it generates and verifies a total of n — 2 MACs, 
and generates a final MAC for each request in the batch 
to authenticate the response to the client. The total com- 
putational load per request is thus 6 + (np 2p~) oy at the 


primary, and 6 + (nbd) Gy at a replica. The system’s 

throughput at saturation during a sufficiently long view 

in a gracious interval is thus at least 9) EEDA), re- 
a 


quests/sec. LI 


Theorem 2. Consider an uncivil view in which the pri- 
mary is correct and at most f replicas are Byzantine. 
Suppose the system is saturated, but only a fraction of 
the requests received by the primary are correct. The 
throughput of Aardvark in this uncivil view is within a 
constant factor of its throughput in a gracious view in 
which the primary uses the same batch size. 


Proof. Let @ and a denote the cost of verifying, respec- 
tively, a signature and a MAC. We show that if g is the 
fraction of correct requests, the throughput during un- 
civil views with a correct primary approaches g of the 
gracious view’s throughput as the ratio 4 tends to 0. 


In an uncivil view, faulty clients may send unfaith- 
ful requests to every server. Before being able to form 
a batch of 6 correct requests, the primary may have 
to verify ; signatures and MACs, and correct replicas 


may verify ; signatures and an additional Gale! — g) 
MACs. Because a correct server processes messages 
from other servers in round-robin order, it will pro- 
cess at most two messages from a faulty server per 
message that it would have processed had the server 


been correct. The total computational load per request 


is thus AG + ora) ttginn te) a) at the primary, and 
2 (O+ Prtg(n= TF) a) at areplica. The system’s through- 


put at saturation during a sufficiently long view in an 
uncivil interval with a correct primary thus is at least 


gk 
9g OUT Ham=TETT, requests per second: as the ratio 


g tends to 0, the ratio between the uncivil and gracious 
throughput approaches g. LJ 


Theorem 3. For sufficiently long uncivil executions and 
for small f the throughput of Aardvark, when properly 
configured, is within a constant factor of its throughput 
in a gracious execution in which primary replicas use the 
same batch size. 


Proof. First consider the case in which all the uncivil 
views have correct primary replicas. Assume that in a 
properly configured Aardvark togseViewTimeout 1S set SO 
that during an uncivil interval, a view change to a cor- 
rect primary completes within tygseViewTimeout: Since 
a primary’s view lasts at least tgracePeriod, aS the ra- 
tio | tends to 0, the ratio between the throughput dur- 
ing a gracious view and an uncivil interval approaches 


gracePeriod 


tnase ViewTimeout tt gracePeriod 
Now consider the general case. If the uncivil interval 


is long enough, at most £ of its views will have a Byzan- 
tine primary. Aardvark’s heartbeat timer provides two 
guarantees. First, a Byzantine server that does not pro- 
duce the throughput that is expected of a correct server 
will not last as primary for longer than a grace period. 
Second, a correct server is always retained as a primary 
for at least the length of a grace period. Furthermore, 
since the throughput expected of a primary at the begin- 
ning of a view is a constant fraction of the maximum 
throughput achieved by the primary replicas of the last 
n views, faulty primary replicas cannot arbitrarily lower 
the throughput expected of a new primary. Finally, since 
the view change timeout is reset after a view change 
that results in at least one request being executed in the 
new view, no view change attempt takes longer then 
tmaxViewTimeout = DO eo namo It follows that, 
during a sufficiently long uncivil interval, the throughput 
will be within a factor of ; nS of 


maz View Timeout +t gracePeriod nr 


t gracePeriod 


that of Theorem 2, and, as a tends to 0, the ratio between 
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Figure 6: Latency vs. throughput for various BFT sys- 


tems. 
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7 Evaluation 


We evaluate the performance of Aardvark, PBFT, HQ, 
Q/U and Zyzzyva on an Emulab cluster [31]. This clus- 
ter consists of machines with dual 3GHz Intel Pentium 4 
Xeon processors, 1GB of memory, and | Gb/s Ethernet 
connections. 

The code bases used to report our results are provided 
by the respective systems’ authors. James Cowling pro- 
vided us the December 2007 public release of the PBFT 
code base [5] as well as a copy of the HQ co-debase. 
We used version 1.3 of the Q/U co-debase, provided to 
us by Michael Abd-El-Malek in October 2008 [27]. The 
Zyzzyva co-debase is the version used in the SOSP 2007 
paper [18]. Whenever feasible, we rely on the exist- 
ing pre-configurations for each system to handle f = 1 
Byzantine failure. 

Our evaluation makes three points: (a) despite our 
choice to utilize signatures, change views regularly, and 
forsake IP multicast, Aardvark’s peak throughput is com- 
petitive with that of existing systems; (b) existing sys- 
tems are vulnerable to significant disruption as a result 
of a broad range of Byzantine behaviors; and (c) Aard- 
vark is robust to a wide range of Byzantine behaviors. 
When evaluating existing systems, we attempt to iden- 
tify places where the prototype implementation departs 
from the published protocol. 


7.1 Aardvark 


Aardvark’s peak throughput is competitive with that of 
state of the art systems as shown in Figure 6. Aard- 
vark’s throughput peaks 38667 operations per second, 
while Zyzzyva and PBFT observe maximum throughputs 
of 65999 and 61710 operations per second, respectively. 

Figures 7 and 8 explore the impact of regular view 
changes on the latency observed by Aardvark clients in 
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Figure 7: The latency of an individual client’s requests 


running Aardvark with 210 total clients. The sporadic 
jumps represent view changes in the protocol. 
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Figure 8: CDF of request latencies for 210 clients issuing 
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an experiment with 210 clients each issuing 100,000 re- 
quests. Figure 7 shows the per request latency observed 
by a single client during the run. The periodic latency 
spikes correspond to view changes. When a client is- 
sues a request as the view change 1s initiated, the request 
is not processed until the request arrives at the new pri- 
mary following a client timeout and retransmission. In 
most cases a single client retransmission is sufficient, but 
additional retransmissions may be required when mul- 
tiple view changes occur in rapid succession. Figure 8 
shows the CDF for latencies of all client requests in the 
Same experiment. We see that 99.99% of the requests 
have latency under I5ms, and only a small fraction of 
all requests incur the higher latencies induced by view 
changes. We configure an Aardvark client with a re- 
transmission timeout of 150ms and we have not explored 
other settings. 
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Aardvark w/o signatures 97405 
Aardvark w/o regular view changes 39771 





Table 2: Peak throughput of Aardvark and incremental 
versions of the Aardvark protocol 


7.1.1 Putting Aardvark together 


Aardvark incorporates several key design decisions that 
enable it to perform well in the presence of Byzantine 
failure. We study the performance impact of these de- 
cisions by measuring the throughput of several PBFT 
and Aardvark variations, corresponding to the evolution 
between these two systems. Table 2 reports these peak 
throughputs. 

While requiring clients in PBFT to sign requests re- 
duces throughput by 50%, we find that the cost of requir- 
ing Aardvark clients to use the hybrid MAC-signature 
scheme imposes a smaller 33% hit to system through- 
put. Explicitly separating the work queues for client 
and replica communication makes it easy for Aardvark 
to utilize the second processor in our test bed machines, 
which reduces the additional costs Aardvark pays to ver- 
ify signed client requests. This parallelism is the pri- 
mary source of the 30% improvement we observe be- 
tween PBFT with signatures and Aardvark. 

Peak throughput for Aardvark with and without reg- 
ular view changes is comparable. The reason for this 
is rather straightforward: when both the new and old 
primary replicas are non-faulty, a view change requires 
approximately the same amount of work as a single in- 
stance of consensus. Aardvark views led by a non-faulty 
primary are sufficiently long that the throughput costs as- 
sociated with performing a view change are negligible. 


7.2 Evaluating faulty systems 


In this section we evaluate Aardvark and existing sys- 
tems in the context of failures. It is impossible to test 
every possible Byzantine behavior; consequently we use 
our knowledge of the systems to construct a set of work- 
loads that we believe to be close to the worst case for 
Aardvark and other systems. While other faulty behav- 
iors are possible and may stress the evaluated systems in 
different ways, we believe that our results are indicative 
of both the frailty of existing systems and the robustness 
of Aardvark. 


7.2.1 Faulty clients 


We focus our attention on two aspects of client behavior 
that have significant impact on system throughput: re- 
quest dissemination and network flooding. 


Request dissemination. Table | in the Introduction 
explores the impact of faulty client behavior related to re- 
quest distribution on the PBFT, HQ, Zyzzyva, and Aard- 
vark prototypes. We implement different client behaviors 
for the different systems in order to stress test the design 
decisions the systems have made. 

In PBFT and Zyzzvya, the clients send requests that 
are authenticated with MAC authenticators. The faulty 
client includes an inconsistent authenticator on requests 
so that request verification will succeed at the primary 
but fail for all other replicas. When the primary includes 
the client request in a PRE-PREPARE message, the repli- 
cas are unable to verify the request. 

We developed this workload because, on paper, the 
protocols specify what appears to be an expensive pro- 
cessing path to handle this contingency. In this situa- 
tion PBFT specifies a view change while Zyzzyva in- 
vokes a conflict resolution procedure that blocks progress 
and requires replicas to generate signatures. In theory 
these procedures should have a noticeable, though finite, 
impact on performance. In particular, PBFT progress 
should stall until a timeout forces a new view ([6] pp. 42— 
43), at which point other clients can make some progress 
until the faulty client stalls progress again. In Zyzzyva, 
the servers should pay extra overheads for signatures and 
view changes. 

In practice the throughput of both prototype imple- 
mentations drops to 0. In Zyzzyva the reconciliation pro- 
tocol is not fully implemented; in PBFT the client be- 
havior results in repeated view changes, and we have not 
observed our experiment to finish. While the full PBFT 
and Zyzzyva protocol specifications guarantee liveness 
under eventual synchrony, the protocol steps required to 
handle these cases are sufficiently complex to be difficult 
to implement, easy to overlook, or both. 

In HQ, our intended attack is to have clients send cer- 
tificates during the WRITE-2 phase of the protocol with 
an inconsistent MAC authenticator. The response speci- 
fied by the protocol is a signed WRITE-2-REFUSED mes- 
sage which is subsequently used by the client to initiate 
a call to initiate a request processed by an internal PBFT 
protocol. This set of circumstances presents a point in 
the HQ design where a single client, either faulty or sim- 
ply unlucky, can force the replicas to generate expensive 
signatures resulting in a degradation in system through- 
put. We are unable to evaluate the precise impact of this 
client behavior because the replica processing necessary 
to handle inconsistent MAC authenticators from clients 
is not implemented. 

Q/U clients, in the lack of contention, are unable to 
influence each other’s operations. During contention, 
replicas are required to perform barrier and commit op- 
erations that are rate limited by a client-initiated expo- 
nential back off. During the barrier and commit opera- 
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tions, a faulty client that sends inconsistent certificates 
to the replicas can theoretically complicate the process 
further. We implement a simpler scenario in which all 
clients are correct, yet they issue contending requests to 
the replicas. In this setting with only 20 clients, Q/U pro- 
vides 0 throughput. Q/U’s focus on performance in the 
absence of both failures and contention makes it espe- 
cially vulnerable in practice—clients that issue contend- 
ing requests can decimate system throughput, whether 
the clients are faulty or not. 

To avoid corner cases where different replicas make 
different judgments about the legitimacy of a request, 
Aardvark clients sign requests. In Aardvark, the closest 
analogous client behaviors to those discussed above for 
other systems are sending requests with a valid MAC and 
invalid signature or sending requests with invalid MACs. 
We implement both attacks and find the results to be 
comparable. In Table 1 we report the results for requests 
with invalid MACs. 


Network flooding. In Table 3 we demonstrate the im- 
pact of a single faulty client that floods the replicas with 
messages. During these experiments correct clients issue 
requests sufficient to saturate each system while a single 
faulty client implements a brute force denial of service 
attack by repeatedly sending 9KB UDP messages to the 
replicas. For PBFT and Zyzzyva, 210 clients are suffi- 
cient to saturate the servers while Q/U and HQ are satu- 
rated with 30 client processes. 

The PBFT and Zyzzyva prototypes suffer dramatic 
performance degradation as their incoming network re- 
sources are consumed by the flooding client; process- 
ing the incoming client requests disrupt the replica- 
to-replica communication necessary for the systems to 
make progress. In both cases, the pending client re- 
quests eventually overflows internal queues and crashes 
the servers. Q/U and HQ suffer smaller degradations in 
throughput from the spamming replicas. The UDP traffic 
is dropped by the network stack with minimal processing 
because they are not valid TCP packets. The slowdowns 
observed in Q/U and HQ correspond to the displaced net- 
work bandwidth. 

The reliance on TCP communication in Q/U and HQ 
changes rather than solves the challenge presented by a 
flooding client. For example, a single faulty client that 
repeatedly requests TCP connections crashes both the 
Q/U and HQ servers. 

In each of these systems, the vulnerability to network 
flooding is a byproduct of the prototype implementation 
and is not fundamental to the protocol design. Network 
isolation techniques such as those described in Section 5 
could similarly be applied to these systems. 

In the case of Aardvark, the decision to use separate 
NICs and work queues for client and replica requests 
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Peak Throughput 
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0 
[Zyzayva [65009 | erash | - 
Aardvark | 38667 | 7873 | 


Table 3: Observed peak throughput of BFT systems in 
the fault free case and under heavy client retransmis- 
sion load. UDP network flooding corresponds to a single 
faulty client sending 9KB messages. TCP network flood- 
ing corresponds to a single faulty client sending requests 
to open TCP connections and is shown for TCP based 
systems. 





Table 4: Throughput during intervals in which the pri- 
mary delays sending PRE-PREPARE message (or equiva- 
lent) by 1, 10, and 100 ms. 


ensures that a faulty client is unable to prevent replicas 
from processing requests that have already entered the 
system. The throughput degradation observed by Aard- 
vark tracks the fraction of requests that replicas receive 
that were sent by non-faulty clients. 


7.2.2 Faulty Primary 


In systems that rely on a primary, the primary controls 
the sequence of requests that are processed during the 
current view. 

In Table 4 we show the impact on PBFT, Zyzzyva, 
and Aardvark prototypes of a primary that delays send- 
ing PRE-PREPARE messages by 1, 10, or 100 ms. The 
throughput of both PBFT and Zyzzyva degrades dramat- 
ically as the slow primary is not slow enough to trigger 
their view change conditions. This throughput degrada- 
tion is a consequence of the protocol design and spec- 
ification of when view changes should occur. With an 
extremely slow primary, Zyzzyva eventually succumbs 
to a memory leak exacerbated by holding on to requests 
for an extended period of time. The throughput achieved 
by Aardvark indicates that adaptively performing view 
changes in response to observed throughput is a good 
technique for ensuring performance. 

In addition to controlling the rate at which requests 
are inserted into the system, the primary is also respon- 
sible for controlling which requests are inserted into the 
system. Table 5 explores the impact that an unfair pri- 
mary can have on the throughput of a targeted node. In 
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Starved Throughput | Normal Throughput 


PBFT (446 
(718 





Table 5: Average throughput for a starved client that is 
shunned by a faulty primary versus the average per-client 
throughput for any other client. 


the case of PBFT and Aardvark, the primary sends a 
PRE-PREPARE for the targeted client’s request only af- 
ter receiving the the request 9 times. This heuristic pre- 
vents the PBFT primary from triggering a view change 
and demonstrates dramatic degradation in throughput for 
the targeted client in comparison to the other clients in 
the system. For Zyzzyva, the unfair primary ignores 
messages from the targeted client entirely. The result- 
ing throughput is 0 because the implementation is in- 
complete, and replicas in the Zyzzyva prototype do not 
forward received requests to the primary as specified by 
the protocol. Aardvark’s fairness detection and periodic 
view changes limit the impact of the unfair primary. 


7.2.3. Non-Primary Replicas 


We implement a faulty replica that fails to process pro- 
tocol messages and insted blasts network traffic at the 
other replicas and show the results in Table 6. In the 
first experiments, a faulty replica blasts 9KB UDP mes- 
sages at the other replicas. The PBFT and Zyzzyva pro- 
totypes again show very low performance as the incom- 
ing traffic from the spamming replica displaces much of 
the legitimate traffic in the system, denying the system 
both requests from the clients and also replica messages 
required to make progress. Aardvark’s use of separate 
worker queues ensures that the replicas process the mes- 
sages necessary to make progress. In the second exper- 
iment, the faulty The Q/U and HQ replicas again open 
TCP connections, consuming all of the incoming con- 
nections on the other replicas and denying the clients ac- 
cess to the service. 

Once again, the shortcomings of the systems are a 
byproduct of implementation and not protocol design. 
We speculate that improved network isolation techniques 
would make the systems more robust. 


$8 Related work 


We are not the first to notice significantly reduced per- 
formance for BFT protocols during periods of failures or 
bad network performance or to explore how timing and 
failure assumptions impact performance and liveness of 
fault tolerant systems. 

Singh et al. [29] show that PBFT [8], Q/U [1], 
HQ [12], and Zyzzyva [18] are all sensitive to network 
performance. They provide a thorough examination of 


Replica Flooding 
UDP TCP 


Peak Throughput 


Perr | cirl0 | 251 | - 


Zynyva [65099 «| 0 | 
Aardvark [38667 | 1706 | - 


Table 6: Observed peak throughput and observed 
throughput when one replica floods the network with 
messages. UDP flooding consists of a replica sending 
OKB messages to other replicas rather than following the 
protocol. TCP flooding consists of a replica repeatedly 
attempting to open TCP connections on other replicas. 





the gracious executions of the four canonical systems 
through a ns2 [25] network simulator. Singh et al. ex- 
plore performance properties when the participants are 
well behaved and the network is faulty; we focus our at- 
tention on the dual scenario where the participants are 
faulty and the network is well behaved. 

Aiyer et al. [3] and Amir et al. [4] note that a slow 
primary can result in dramatically reduced throughput. 
Aiyer et al. combat this problem by frequently rotating 
the primary. Amir et al. address the challenge instead by 
introducing a pre-agreement protocol requiring several 
all-to-all message exchanges and utilizing signatures for 
all authentication. Their solution is designed for envi- 
ronments where throughout of 800 requests per second 
is considered good. Condie et al. [11] address the ability 
of a well placed adversary to disrupt the performance of 
an overlay network by frequently restructuring the over- 
lay, effectively changing its view. 

The signature processing and scheduling of replica 
messages in Aardvark is similar in flavor to the early 
rejection techniques employed by the LOCKSS sys- 
tem [15, 24] in order to improve performance and limit 
the damage an adversary can inflict on system. 

PBFT [8], Q/U [1], HQ [12], and Zyzzyva [18] are re- 
cent BFT replication protocols that focus on optimizing 
performance during gracious executions and collectively 
demonstrate that BFT replication systems can provide 
excellent performance during gracious executions. We 
instead focus on increasing the robustness of BFT sys- 
tems by providing good performance during uncivil exe- 
cutions. Hendricks et al. [17] explore the use of erasure 
coding increase the efficiency of BFT replicated storage; 
they emphasizes increasing the bandwidth and storage 
efficiency of a replication protocol similar to Q/U and 
not the fault tolerance of the replication protocol. 


9 Conclusion 


We claim that high assurance systems require BFT pro- 
tocols that are more robust to failures than existing sys- 
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tems. Specifically, BFT protocols suitable for high as- 
surance systems must provide adequate throughput dur- 
ing uncivil intervals in which the network is well behaved 
but an unknown number of clients and up to f servers are 
faulty. We present Aardvark, the first BFT state machine 
protocol designed and implemented to provide good per- 
formance in the presence of Byzantine faults. Aardvark 
gives up some throughput during gracious executions, for 
significant improvement in performance during uncivil 
executions. 

Aardvark is far from being the last word in robust 
BFT replication: we believe that improvements to the 
design and implementation of Aardvark, as well as to 
the methodology that led us to it, are both possible and 
likely. Specific challenges that remain for future work 
include formally verifying the design and implementa- 
tions of BFT systems, developing a notion of optimal- 
ity for robust BFT systems that captures the fundamen- 
tal tradeoffs betwee fault-free and fault-full performance, 
and extending BFT replication to deployable large scale 
applications. 
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Abstract 


Many distributed services are hosted at large, shared, geograph- 
ically diverse data centers, and they use replication to achieve 
high availability despite the unreachability of an entire data 
center. Recent events show that non-crash faults occur in these 
services and may lead to long outages. While Byzantine-Fault 
Tolerance (BFT) could be used to withstand these faults, cur- 
rent BFT protocols can become unavailable if a small frac- 
tion of their replicas are unreachable. This is because exist- 
ing BFT protocols favor strong safety guarantees (consistency) 
over liveness (availability). 

This paper presents a novel BFT state machine replication 
protocol called Zeno that trades consistency for higher avail- 
ability. In particular, Zeno replaces strong consistency (/in- 
earizability) with a weaker guarantee (eventual consistency): 
clients can temporarily miss each other’s updates but when the 
network is stable the states from the individual partitions are 
merged by having the replicas agree on a total order for all re- 
quests. We have built a prototype of Zeno and our evaluation 
using micro-benchmarks shows that Zeno provides better avail- 
ability than traditional BFT protocols. 


1 Introduction 


Data centers are becoming a crucial computing platform 
for large-scale Internet services and applications in a va- 
riety of fields. These applications are often designed as 
a composition of multiple services. For instance, Ama- 
zon’s S3 storage service and its e-commerce platform use 
Dynamo [15] as a storage substrate, or Google’s indices 
are built using the MapReduce [14] parallel processing 
framework, which in turn can use GFS [18] for storage. 

Ensuring correct and continuous operation of these 
services is critical, since downtime can lead to loss of 
revenue, bad press, and customer anger [5]. Thus, to 
achieve high availability, these services replicate data 
and computation, commonly at multiple sites, to be able 
to withstand events that make an entire data center un- 
reachable [15] such as network partitions, maintenance 
events, and physical disasters. 

When designing replication protocols, assumptions 
have to be made about the types of faults the protocol 
is designed to tolerate. The main choice lies between a 
crash-fault model, where it is assumed nodes fail cleanly 
by becoming completely inoperable, or a Byzantine-fault 
model, where no assumptions are made about faulty 


components, capturing scenarios such as bugs that cause 
incorrect behavior or even malicious attacks. A crash- 
fault model is typically assumed in most widely deployed 
services today, including those described above; the pri- 
mary motivation for this design choice is that all ma- 
chines of such commercial services run in the trusted en- 
vironment of the service provider’s data center [15]. 

Unfortunately, the crash-fault assumption is not al- 
ways valid even in trusted environments, and the con- 
sequences can be disastrous. To give a few recent exam- 
ples, Amazon’s S3 storage service suffered a multi-hour 
outage, caused by corruption in the internal state of a 
server that spread throughout the entire system [2]; also 
an outage in Google’s App Engine was triggered by a bug 
in datastore servers that caused some requests to return 
errors [19]; and a multi-day outage at the Netflix DVD 
mail-rental was caused by a faulty hardware component 
that triggered a database corruption event [28]. 

Byzantine-fault-tolerant (BFT) replication protocols 
are an attractive solution for dealing with such faults. Re- 
cent research advances in this area have shown that BFT 
protocols can perform well in terms of throughput and la- 
tency [23], they can use a small number of replicas equal 
to their crash-fault counterparts [9,37], and they can be 
used to replicate off-the-shelf, non-deterministic, or even 
distinct implementations of common services [29, 36]. 

However, most proposals for BFT protocols have fo- 
cused on strong semantics such as linearizability [22], 
where intuitively the replicated system appears to the 
clients as a single, correct, sequential server. The price to 
pay for such strong semantics is that each operation must 
contact a large subset (more than 2 or in some cases =) 
of the replicas to conclude, which can cause the system to 
halt if more than a small fraction (¥ or i, respectively) of 
the replicas are unreachable due to maintenance events, 
network partitions, or other non-Byzantine faults. This 
contrasts with the philosophy of systems deployed in cor- 
porate data centers [15,21,34], which favor availability 
and performance, possibly sacrificing the semantics of 
the system, so they can provide continuous service and 
meet tight SLAs [15]. 

In this paper we propose Zeno, a new BFT replication 
protocol designed to meet the needs of modern services 
running in corporate data centers. In particular, Zeno fa- 
vors service performance and availability, at the cost of 
providing weaker consistency guarantees than traditional 
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BFT replication when network partitions and other infre- 
quent events reduce the availability of individual servers. 

Zeno offers eventual consistency semantics [17], 
which intuitively means that different clients can be un- 
aware of the effects of each other’s operations, e.g., dur- 
ing a network partition, but operations are never lost 
and will eventually appear in a linear history of the 
service—corresponding to that abstraction of a single, 
correct, sequential server—once enough connectivity is 
re-established. 

In building Zeno we did not start from scratch, but in- 
stead adapted Zyzzyva [23], a state-of-the-art BFT repli- 
cation protocol, to provide high availability. Zyzzyva 
employs speculation to conclude operations fast and 
cheaply, yielding high service throughput during favor- 
able system conditions—while connectivity and repli- 
cas are available—so it is a good candidate to adapt 
for our purposes. Adaptation was challenging for sev- 
eral reasons, such as dealing with the conflict between 
the client’s need for a fast and meaningful response and 
the requirement that each request is brought to comple- 
tion, or adapting the view change protocols to also enable 
progress when only a small fraction of the replicas are 
reachable and to merge the state of individual partitions 
when enough connectivity is re-established. 

The rest of the paper is organized as follows. Section 2 
motivates the need for eventual consistency. Section 3 
defines the properties guaranteed by our protocol. Sec- 
tion 4 describe how Zeno works and Section 5 sketches 
the proof of its correctness. Section 6 evaluates how our 
implementation of Zeno performs. Section 7 presents re- 
lated work, and Section 8 concludes. 


2 The Case for Eventual Consistency 


Various levels and definitions of weak consistency have 
been proposed by different communities [16], so we need 
to justify why our particular choice is adequate. We 
argue that eventual consistency is both necessary for 
the guarantees we are targetting, and sufficient from the 
standpoint of many applications. 

Consider a scenario where a network partition occurs, 
that causes half of the replicas from a given replica group 
to be on one side of the partition and the other half on the 
other side. This is plausible given that replicated sys- 
tems often spread their replicas over multiple data cen- 
ters for increased reliability [15], and that Internet parti- 
tions do occur in practice [6]. In this case, eventual con- 
sistency 1s necessary to offer high availability to clients 
on both sides of the partition, since it is impossible to 
have both sides of the partitions make progress and si- 
multaneously achieve a consistency level that provided 
a total order on the operations (“seen”’ by all client re- 
quests) [7]. Intuitively, the closest approximation from 
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that idealized consistency that could be offered is even- 
tual consistency, where clients on each side of the parti- 
tion agree on an ordering (that only orders their opera- 
tions with respect to each other), and, when enough con- 
nectivity is re-established, the two divergent states can 
be merged, meaning that a total order between the oper- 
ations on both sides can be established, and subsequent 
operations will reflect that order. 

Additionally, we argue that eventual consistency is 
sufficient from the standpoint of the properties required 
by many services and applications that run in data cen- 
ters. This has been clearly stated by the designers of 
many of these services [3, 13, 15, 21,34]. Applications 
that use an eventually consistent service have to be able 
to work with responses that may not include some previ- 
ously executed operations. To give an example of appli- 
cations that use Dynamo, this means that customers may 
not get the most up-to-date sales ranks, or may even see 
some items they deleted reappear in their shoping carts, 
in which case the delete operation may have to be redone. 
However, those events are much preferrable to having a 
slow, or unavailable service. 

Beyond data-center applications, many other exam- 
ples of eventually consistent services has been deployed 
in common-use systems, for example, DNS. Saito and 
Shapiro [30] provide a more thourough survey of the 
theme. 


3 Algorithm Properties 


We now informally specify safety and liveness properties 
of a generic eventually consistent BFT service. The for- 
mal definitions appear in a separate technical report due 
to lack of space [31]. 


3.1 Safety 


Informally, our safety properties say that an eventu- 
ally consistent system behaves like a centralized server 
whose service state can be modelled as a multi-set. Each 
element of the multi-set is a history (a totally ordered 
subset of the invoked operations), which captures the in- 
tuitive notion that some operations may have executed 
without being aware of each other, e.g., on different sides 
of a network partition, and are therefore only ordered 
with respect to a subset of the requests that were exe- 
cuted. We also limit the total number of divergent his- 


tories, which in the case of Zeno cannot exceed, at any 
N—|failed| 
f+1-—|failed| 
of failed servers, N is the total number of servers and f 
is the maximum number of servers that can fail. 
We also specify that certain operations are commit- 
ted. Each history has a prefix of committed operations, 


and the committed prefixes are related by containment. 


time, | |, where [failed] is the current number 
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Hence, all histories agree on the relative order of their 
committed operations, and the order cannot change in 
the future. Aside from this restriction, histories can be 
merged (corresponding to a partition healing) and can be 
forked, which corresponds to duplicating one of the sets 
in the multi-set. 

Given this state, clients can execute two types of op- 
erations, weak and strong, as follows. Any operation be- 
gins its execution cycle by being inserted at the end of 
any non-empty subset of the histories. At this and any 
subsequent time, a weak operation may return, with the 
corresponding result reflecting the execution of all the 
operations that precede it. In this case, we say that the 
operation 1s weakly complete. For strong operations, they 
must wait until they are committed (as defined above) be- 
fore they can return with a similar way of computing the 
result. We assume that each correct client is well-formed: 
it never issues a new request before its previous (weak or 
strong) request is (weakly or strongly, respectively) com- 
plete. 

The merge operation takes two histories and produces 
a new history, containing all operations in both histo- 
ries and preserving the ordering of committed operations. 
However, the weak operations can appear in arbitrary or- 
dering in the merged histories, preserving the causal or- 
der of operations invoked by the same client. This im- 
plies that weak operations may commit in a different or- 
der than when they were weakly completed. 


3.2 Liveness 


On the liveness side, our service guarantees that a request 
issued by a correct client is processed and a response is 
returned to the client, provided that the client can com- 
municate with enough replicas in a timely manner. 

More precisely, we assume a default round-trip delay 
A and we say that a set of servers IT’ C I, is eventually 
synchronous if there is a time after which every two-way 
message exchange within IT’ takes at most A time units. 
We also assume that every two correct servers or clients 
can eventually reliably communicate. Now our progress 
requirements can be put as follows: 


(L1) If there exists an eventually synchronous set of f+1 
correct servers II’, then every weak request issued 
by a correct client is eventually weakly complete. 


(L2) If there exists an eventually synchronous set of 2 f + 
1 correct servers II’, then every weakly complete 
request or a strong request issued by a correct client 
is eventually committed. 


In particular, (L1) and (L2) imply that if there is a 
an eventually synchronous set of 2/-+ 1 correct replicas, 
then each (weak or strong) request issued by a correct 
client will eventually be committed. 


As we will explain later, ensuring (L1) in the pres- 
ence of partitions may require unbounded storage. We 
will present a protocol addition that bounds the storage 
requirements at the expense of relaxing (L1). 


4 Zeno Protocol 


4.1 System model 


Zeno is a BFT state machine replication protocol. It 
requires N = (3f +1) replicas to tolerate f Byzantine 
faults, 1.e., we make no assumption about the behavior 
of faulty replicas. Zeno also tolerates an arbitrary num- 
ber of Byzantine clients. We assume no node can break 
cryptographic techniques like collision-resistant digests, 
encryption, and signing. The protocol we present in this 
paper uses public key digital signatures to authenticate 
communication. In a separate technical report [31], we 
present a modified version of the protocol that uses more 
efficient symmetric cryptography based on message au- 
thentication codes (MACs). 

The protocol uses two kinds of quorums: strong quo- 
rums consisting of any group of 2f + 1 distinct replicas, 
and weak quorums of f + 1 distinct replicas. 

The system easily generalizes to any N > 3f +1, 
in which case the size of strong quorums becomes 
pee , and weak quorums remain the same, indepen- 
dent of N. Note that one can apply our techniques in 
very large replica groups (where N >> 3f +1) and still 
make progress as long as f +1 replicas are available, 
whereas traditional (strongly consistent) BFT systems 
can be blocked unless at least pee replicas, grow- 
ing with N, are available. 


4.2 Overview 


Like most traditional BFT state machine replication pro- 
tocols, Zeno has three components: sequence number as- 
signment (Section 4.4) to determine the total order of op- 
erations, view changes (Section 4.5) to deal with leader 
replica election, and checkpointing (Section 4.8) to deal 
with garbage collection of protocol and application state. 

The execution goes through a sequence of configu- 
rations called views. In each view, a designated leader 
replica (the primary) is responsible for assigning mono- 
tonically increasing sequence numbers to clients’ opera- 
tions. A replica j is the primary for the view numbered v 
iff 7 =v modN. 

At a high level, normal case execution of a request 
proceeds as follows. A client first sends its request to 
all replicas. A designated primary replica assigns a se- 
quence number to the client request and broadcasts this 
proposal to the remaining replicas. Then all replicas ex- 
ecute the request and return a reply to the client. 
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[0 | __operation tobe performed 
[—s_|flag indicating if this is @ strong operation 


[result ofthe operation. 
Diy | __ cryptographic digest function 





Table 1: Notations used in message fields. 


Once the client gathers sufficiently many matching 
replies—replies that agree on the operation result, the 
sequence number, the view, and the replica history—it 
returns this result to the application. For weak requests, 
it suffices that a single correct replica returned the re- 
sult, since that replica will not only provide a correct 
weak reply by properly executing the request, but it will 
also eventually commit that request to the linear history 
of the service. Therefore, the client need only collect 
matching replies from a weak quorum of replicas. For 
strong requests, the client must wait for matching replies 
from a strong quorum, that is, a group of at least 2f+ 1 
distinct replicas. This implies that Zeno can complete 
many weak operations in parallel across different parti- 
tions when only weak quorums are available, whereas 
it can complete strong operations only when there are 
strong quorums available. 

Whenever operations do not make progress, or if repli- 
cas agree that the primary is faulty, a view change pro- 
tocol tries to elect a new primary. Unlike in previous 
BFT protocols, view changes in Zeno can proceed with 
the concordancy of only a weak quorum. This can allow 
multiple primaries to coexist in the system (e.g., during 
a network partition) which is necessary to make progress 
with eventual consistency. However, as soon as these 
multiple views (with possibly divergent sets of opera- 
tions) detect each other (Section 4.6), they reconcile their 
operations via a merge procedure (Section 4.7), restoring 
consistency among replicas. 

In what follows, messages with a subscript of the form 
O, denote a public-key signature by principal c. In all 
protocol actions, malformed or improperly signed mes- 
sages are dropped without further processing. We inter- 
changeably use terms “non-faulty” and “correct” to mean 
system components (e.g., replicas and clients) that follow 
our protocol faithfully. Table 1 collects our notation. 

We start by explaining the protocol state at the repli- 
cas. Then we present details about the three protocol 
components. We used Zyzzyva [23] as a starting point 
for designing Zeno. Therefore, throughout the presenta- 
tion, we will explain how Zeno differs from Zyzzyva. 
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4.3 Protocol State 


Each replica i maintains the highest sequence number 
n it has executed, the number v of the view it is cur- 
rently participating in, and an ordered history of requests 
it has executed along with the ordering received from 
the primary. Replicas maintain a hash-chain digest h, 
of the 1 operations in their history in the following way: 
hn+1 = D(hn,D(REQn+1)), where D is a cryptographic 
digest function and REQ, ; 1s the request assigned se- 
quence number n-+ 1. 

A prefix of the ordered history upto sequence number 
£ is called committed when a replica gathers a commit 
certificate (denoted CC and described in detail in Sec- 
tion 4.4) for 2; each replica only remembers the highest 
CC it witnessed. 

To prevent the history of requests from growing with- 
out bounds, replicas assemble checkpoints after every 
CHKP_INTERVAL sequence numbers. For every check- 
point sequence number £, a replica first obtains the CC 
for ¢ and executes all operations upto and including @. At 
this point, a replica takes a snapshot of the application 
state and stores it (Section 4.8). 

Replicas remember the set of operations received from 
each client c in their reguest[c] buffer and only the last 
reply sent to each client in their reply[c] buffer. The re- 
quest buffer is flushed when a checkpoint is taken. 


4.4 Sequence Number Assignment 


To describe how sequence number assignment works, we 
follow the flow of a request. 


Client sends request. A correct client c sends a request 
(REQUEST, 0,t,C,5)o, to all replicas, where o is the op- 
eration, f is a Sequence number incremented on every re- 
quest, and s is the strong operation flag. 


Primary assigns sequence number and broadcasts or- 
der request (OR) message. If the last operation ex- 
ecuted for this client has timestamp ¢’ = t — 1, then 
primary 7 assigns the next available sequence number 
n-+ 1 to this request, increments n, and then broadcasts 
a (OR,v,n,hn,D(REQ),i,5,ND) o>, message to backup 
replicas. ND is a set of non-deterministic application 
variables, such as a seed for a pseudorandom num- 
ber generator, used by the application to generate non- 
determinism. 


Replicas receive OR. When a replica j receives an 
OR message and the corresponding client request, it first 
checks if both are authentic, and then checks if it is in 
view v. If valid, it calculates h),, , = D(hn, D(REQ)) and 
checks if h,,,,; is equal to the history digest in the OR 
message. Next, it increments its highest sequence num- 
ber n, and executes the operation o from REQ on the ap- 
plication state and obtains a reply r. A replica sends the 
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reply ((SPECREPLY,v,1,hn,D(r),c,t)o;,j,4OR) im- 
mediately to the client if s is false (1.e., this is a weak 
request). If s is true, then the request must be com- 
mitted before replying, so a replica first multicasts a 
(COMMIT,OR, j)o,; to all others. When a replica re- 
ceives at least 2f +1 such COMMIT messages (in- 
cluding its own) matching in n, v, h,, D(REQ), it 
forms a commit certificate CC consisting of the set of 
COMMIT messages and the corresponding OR, stores 
the CC, and sends the reply to the client in a message 
((REPLY,v,7,/n,D(1),¢,t)o;;j,% OR). The primary fol- 
lows the same logic to execute the request, potentially 
committing it, and sending the reply to the client. Note 
that the commit protocol used for strong requests will 
also add all the preceding weak requests to the set of 
committed operations. 


Client receives responses. For weak requests, if a 
client receives a weak quorum of SPECREPLY messages 
matching in their v, n, h, r, and OR, it considers the re- 
quest weakly complete and returns a weak result to the 
application. For strong requests, a client requires match- 
ing REPLY messages from a strong quorum to consider 
the operation complete. 


Fill Hole Protocol. Replicas only execute requests— 
both weak and strong—in sequence number order. How- 
ever, due to message loss or other network disrup- 
tions, a replica 7 may receive an OR or a COMMIT 
message with a higher-than-expected sequence num- 
ber (that is, OR.n > n+ 1); the replica discards such 
messages, asking the primary to “fill it in” on what 
it has missed (the OR messages with sequence num- 
bers between n+ 1 and OR.n) by sending the primary 
a (FILLHOLE,v,n,OR.n,i) message. Upon receipt, the 
primary resends all of the requested OR messages back 
to i, to bring it up-to-date. 


Comparison to Zyzzyva. There are four important 
differences between Zeno and Zyzzyva in the normal ex- 
ecution of the protocol. 

First, Zeno clients only need matching replies from a 
weak quorum, whereas Zyzzyva requires at least a strong 
quorum; this leads to significant increase in availability, 
when for example only between f+ 1 and 2/ replicas are 
available. It also allows for slightly lower overhead at the 
client due to reduced message processing requirements, 
and to a lower latency for request execution when inter- 
node latencies are heterogeneous. 

Second, Zeno requires clients to use sequential times- 
tamps instead of monotonically increasing but not nec- 
essarily sequential timestamps (which are the norm in 
comparable systems). This is required for garbage col- 
lection (Section 4.8). This raises the issue of how to deal 


with clients that reboot or otherwise lose the informa- 
tion about the latest sequence number. In our current im- 
plementation we are not storing this sequence number 
persistently before sending the request. We chose this 
because the guarantees we obtain are still quite strong: 
the requests that were already committed will remain in 
the system, this does not interfere with requests from 
other clients, and all that might happen is the client los- 
ing some of its initial requests after rebooting or old- 
est uncommitted requests. As future work, we will de- 
vise protocols for improving these guarantees further, or 
for storing sequence numbers efficiently using SSDs or 
NVRAM. 

Third, whereas Zyzzyva offers a single-phase perfor- 
mance optimization, in which a request commits in only 
three message steps under some conditions (when all 
3 f+ 1 replicas operate roughly synchronously and are all 
available and non-faulty), Zeno disables that optimiza- 
tion. The rationale behind this removal is based on the 
view change protocol (Section 4.5) so we defer the dis- 
cussion until then. A positive side-effect of this removal 
is that, unlike with Zyzzyva, Zeno does not entrust po- 
tentially faulty clients with any protocol step other than 
sending requests and collecting responses. 

Finally, clients in Zeno send the request to all replicas 
whereas clients in Zyzzyva send the request only to the 
primary replica. This change is required only in the MAC 
version of the protocol but we present it here to keep 
the protocol description consistent. At a high level, this 
change is required to ensure that a faulty primary can- 
not prevent a correct request that has weakly completed 
from committing—the faulty primary may manipulate a 
few of the MACs in an authenticator present in the re- 
quest before forwarding it to others, and during commit 
phase, not enough correct replicas correctly verify the 
authenticator and drop the request. Interestingly, we find 
that the implementations of both PBFT and Zyzzyva pro- 
tocols also require the clients to send the request directly 
to all replicas. 

Our protocol description omits some of the pedantic 
details such as handling faulty clients or request retrans- 
missions; these cases are handled similarly to Zyzzyva 
and do not affect the overheads or benefits of Zeno when 
compared to Zyzzyva. 


4.5 View Changes 


We now turn to the election of a new primary when the 
current primary is unavailable or faulty. The key point 
behind our view change protocol is that it must be able 
to proceed when only a weak quorum of replicas is avail- 
able unlike view change algorithms in strongly consistent 
BFT systems which require availability of a strong quo- 
rum to make progress. The reason for this is the follow- 
ing: strongly consistent BFT systems rely on the quorum 
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intersection property to ensure that if a strong quorum Q 
decides to change view and another strong quorum Q’ de- 
cides to commit a request, there is at least one non-faulty 
replica in both quorums ensuring that view changes do 
not “lose” requests committed previously. This implies 
that the sizes of strong quorums are at least 2/+ 1, so 
that the intersection of any two contains at least f+ 1 
replicas, including—since no more than f of those can 
be faulty—at least one non-faulty replica. In contrast, 
Zeno does not require view change quorums to intersect; 
a weak request missing from a view change will be even- 
tually committed when the correct replica executing it 
manages to reach a strong quorum of correct replicas, 
whereas strong requests missing from a view change will 
cause a subsequent provable divergence and application- 
State merge. 


View Change Protocol. A client c retransmits the re- 
quest to all replicas if it times out before completing its 
request. A replica receiving a client retransmission first 
checks if the request is already executed; if so, it simply 
resends the SPECREPLY/REPLY to the client from its re- 
ply[c] buffer. Otherwise, the replica forwards the request 
to the primary and starts a [HateThePrimary timer. 

In the latter case, if the replica does not receive 
an OR message before it times out, it broadcasts 
(IHATETHEPRIMARY, Vv) , to all replicas, but contin- 
ues to participate in the current view. If a replica 
receives such accusations from a weak quorum, it 
stops participating in the current view v and sends a 
(VIEWCHANGE,v-+ 1,CC, ©), to other replicas, where 
CC is the highest commit certificate, and © is i’s or- 
dered request history since that commit certificate, 1.e., 
all OR messages for requests with sequence numbers 
higher than the one in CC. It then starts the view change 
timer. 

The primary replica j for view v+ 1 starts a timer with 
a shorter timeout value called the aggregation timer and 
waits until it collects a set of VIEWCHANGE messages 
for view v+ 1 from a strong quorum, or until its aggre- 
gation timer expires. If the aggregation timer expires and 
the primary replica has collected f+ 1 or more such mes- 
sages, it sends a (NEWVIEW,v+ 1,P)q, to other repli- 
cas, where P is the set of VIEWCHANGE messages it 
gathered (we call this a weak view change, as opposed to 
one where a strong quorum of replicas participate which 
is called a strong view change). If a replica does not 
receive the NEWVIEW message before the view change 
timer expires, it starts a view change into the next view 
number. 

Note that waiting for messages from a strong quorum 
is not needed to meet our eventual consistency specifi- 
cation, but helps to avoid a situation where some opera- 
tions are not immediately incorporated into the new view, 
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which would later create a divergence that would need to 
be resolved using our merge procedure. Thus it improves 
the availability of our protocol. 

Each replica locally calculates the initial state for the 
new view by executing the requests contained in ?, 
thereby updating both n and the history chain digest hy. 
The order in which these requests are executed and how 
the initial state for the new view is calculated is related 
to how we merge divergent states from different replicas, 
so we defer this explanation to Section 4.7. Each replica 
then sends a (VIEWCONFIRM,v+ 1,7,/n,i)o, to all oth- 
ers, and once it receives such VIEWCONFIRM messages 
matching in v+ 1, n, and h from a weak or a strong quo- 
rum (for weak or strong view changes, respectively) the 
replica becomes active in view v+ | and stops processing 
messages for any prior views. 

The view change protocol allows a set of f + 1 cor- 
rect but slow replicas to initiate a global view change 
even if there is a set of f + 1 synchronized correct repli- 
cas, which may affect our liveness guarantees (in par- 
ticular, the ability to eventually execute weak requests 
when there is a synchronous set of f + 1 correct servers). 
We avoid this by prioritizing client requests over view 
change requests as follows. Every replica maintains a 
set of client requests that it received but have not been 
processed (put in an ordered request) by the primary. 
Whenever a replica 7 receives a message from j re- 
lated to the view change protocol (IHATETHEPRIMARY, 
VIEWCHANGE, NEWVIEW, or VIEWCONFIRM) for a 
higher view, i first forwards the outstanding requests to 
the current primary and waits until the corresponding 
ORs are received or a timer expires. For each pending re- 
quest, if a valid OR is received, then the replica sends the 
corresponding response back to the client. Then 7 pro- 
cesses the original view change related messages from / 
according to the protocol described above. This guaran- 
tees that the system makes progress even in the presence 
of continuous view changes caused by the slow replicas 
in such pathological situations. 


Comparison to Zyzzyva. View changes in Zeno differ 
from Zyzzyva in the size of the quorum required for a 
view change to succeed: we require f + 1 view change 
messages before a new view can be announced, whereas 
previous protocols required 2f + 1 messages. Moreover, 
the way a new view message is processed is also dif- 
ferent in Zeno. Specifically, the start state in a new 
view must incorporate not only the highest CC in the 
VIEWCHANGE messages, but also all ORDERREQ that 
appear in any VIEWCHANGE message from the previ- 
ous view. This guarantees that a request is incorporated 
within the state of a new view even if only a single replica 
reports it; in contrast, Zyzzyva and other similar proto- 
cols require support from a weak quorum for every re- 
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quest moved forward through a view change. This is re- 
quired in Zeno since it is possible that only one replica 
supports an operation that was executed in a weak view 
and no other non-faulty replica has seen that operation, 
and because bringing such operations to a higher view is 
needed to ensure that weak requests are eventually com- 
mitted. 


The following sections describe additions to the view 
change protocols to incorporate functionality for detect- 
ing and merging concurrent histories, which are also ex- 
clusive to Zeno. 


4.6 Detecting Concurrent Histories 


Concurrent histories (1.e., divergence in the service state) 
can be formed for several reasons. This can occur when 
the view change logic leads to the presence of two repli- 
cas that simultaneously believe they are the primary, and 
there are a sufficient number of other replicas that also 
share that belief and complete weak operations proposed 
by each primary. This could be the case during a network 
partition that splits the set of replicas into two subsets, 
each of them containing at least f + 1 replicas. 


Another possible reason for concurrent histories is that 
the base history decided during a view change may not 
have the latest committed operations from prior views. 
This is because a view change quorum (a weak quorum) 
may not share a non-faulty replica with prior commit- 
ment quorums (strong quorums) and remaining replicas; 
as aresult, some committed operations may not appear in 
VIEWCHANGE messages and, therefore, may be missing 
from the new starting state in the NEWVIEW message. 


Finally, a misbehaving primary can also cause diver- 
gence by proposing the same sequence numbers to dif- 
ferent operations, and forwarding the different choices 
to disjoint sets of replicas. 


Basic Idea. Two request history orderings h',h},... 
and hi ,h},..., present at replicas i and j respectively, 
are called concurrent if there exists a sequence num- 
ber 1 such that hi, ra h!; because of the collision resis- 
tance of the hash chaining mechanism used to produce 
history digests, this means that the sequence of requests 
represented by the two digests differ as well. A replica 
compares history digests whenever it receives protocol 
messages such as OR, COMMIT, or CHECKPOINT (de- 
scribed in Section 4.8) that purport to share the same his- 
tory as its own. 


For clarity, we first describe how we detect divergence 
within a view and then discuss detection across views. 
We also defer details pertaining to garbage collection of 
replica state until Section 4.8. 


4.6.1 Divergence between replicas in same view 


Suppose replica 7 is in view v;, has executed up to 
sequence number n;, and receives a properly authen- 
ticated message (OR,vj,7j,/n;,D(REQ),p,5,ND)o, 
or (COMMIT, (OR, vj,7j,/n;,D(REQ),P,5,ND)o,; J)o; 
from replica /. 

If nj < nj, ie., Jj has executed a request with 
sequence number n,;, then the fill-hole mecha- 
nism is started, and 7 receives from j a message 
(OR, vn), hn,,D(REQ’),k,s,ND)o,, where v’ < vy; and 
k = primary(v’). 

Otherwise, if n; > n;, both replicas have executed a 
request with sequence number n; and therefore 7 must 
have the some (OR, v’,1;,/n,,D(REQ’),k,5,ND)o, mes- 
sage in its log, where v’ < v; and k = primary(v’). 

If the two history digests match (the local hy, or hn, 
depending on whether n; > n;, and the one received in 
the message), then the two histories are consistent and 
no concurrency is deduced. 

If instead the two history digests differ, the histories 
must differ as well. If the two OR messages are authen- 
ticated by the same primary, together they constitute a 
proof of misbehavior (POM); through an inductive argu- 
ment it can be shown that the primary must have assigned 
different requests to the same sequence number n;. Such 
a POM is sufficient to initiate a view change and a merge 
of histories (Section 4.7). 

The case when the two OR messages are authenticated 
by different primaries indicates the existence of diver- 
gence, caused for instance by a network partition, and 
we discuss how to handle it next. 


4.6.2 Divergence across views 


Now assume that replica 7 receives a message from 
replica 7 indicating that v; > v;. This could happen due to 
a partition, during which different subsets changed views 
independently, or due to other network and replica asyn- 
chrony. Replica 7 requests the NEWVIEW message for 
v; from j. (The case where v; < v; is similar, with the 
exception that 7 pushes the NEWVIEW message to / in- 
stead.) 

When node i receives and_ verifies the 
(NEWVIEW,V;,P)o, message, where p is the issu- 
ing primary of view v;, it compares its local history to 
the sequence of OR messages obtained after ordering 
the OR message present in the NEWVIEW message 
(according to the procedure described in Section 4.7). 
Let n; and n;, be the lowest and highest sequence 
numbers of those OR messages, respectively. 


Case 1: [n; < n;] Replica i is missing future requests, 
so it sends 7 a FILLHOLE message requesting the OR 
messages between n; and n;. When these are received, it 
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compares the OR message for n; to detect if there was di- 
vergence. If so, the replica obtained a proof of divergence 
(POD), consisting of the two OR messages, which it can 
use to initiate a new view change. If not, it executes the 
operations from n; to n; and ensures that its history af- 
ter executing 7; 1s consistent with the CC present in the 
NEWVIEW message, and then handles the NEWVIEW 
message normally and enters v;. If the histories do not 
match this also constitutes a POD. 


Case 2: [nj < nj < nj] Replica 7 must have the cor- 
responding ORDERREQ for all requests with sequence 
numbers between n,; and n; and can therefore check if 
its history diverges from that which was used to gener- 
ate the new view. If it finds no divergence, it moves to 
v; and calculates the start state based on the NEW VIEW 
message (Section 4.5). Otherwise, it generates a POD 
and initiates a merge. 


Case 3: [n; > n;,] Replica i has corresponding OR 
messages for all sequence numbers appearing in the 
NEWVIEW and can check for divergence. If no diver- 
gence is found, the replica has executed more requests in 
a lower view v; than v;. Therefore, it generates a Proof 
of Absence (POA), consisting of all OR messages with 
sequence numbers in |1;,7;| and the NEW VIEW message 
for the higher view, and initiates a merge. If divergence 
is found, 7 generates a POD and also initiates a merge. 

Like traditional view change protocols, a replica i does 
not enter v; if the NEWVIEW message for that view did 
not include all of 7’s committed requests. This is im- 
portant for the safety properties providing guarantees for 
strong operations, since it excludes a situation where re- 
quests could be committed in v; without seeing previ- 
ously committed requests. 


4.7 Merging Concurrent Histories 


Once concurrent histories are detected, we need to merge 
them in a deterministic order. The solution we propose 
is to extend the view change protocol, since many of the 
functionalities required for merging are similar to those 
required to transfer a set of operations across views. 

We extend the view change mechanism so that view 
changes can be triggered by either PODs, POMs or 
POAs. When a replica obtains a POM, a POD, or a POA 
after detecting divergence, it multicasts a message of the 
form (POMMSG,v,POM).,,, (PODMSG,v,POD).,,, or 
(POAMSG,v,POA)., in addition to the VIEWCHANGE 
message for v. Note here that v in POM and POD is 
one higher than the highest view number present in the 
conflicting ORDERREQ messages, or one higher than the 
view number in the NEW VIEW component in the case of 
a POA. 

Upon receiving an authentic and valid POMMSG 
or PODMSG or a POAMSG, a replica broadcasts a 
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VIEWCHANGE along with the triggering POM, POD, or 
POA message. 

The view change mechanism will eventually lead to 
the election of a new primary that is supposed to multi- 
cast a NEWVIEW message. When a node receives such 
a message, it needs to compute the start state for the next 
view based on the information contained in that message. 
The new start state is calculated by first identifying the 
highest CC present among all VIEWCHANGE messages; 
this determines the new base history digest /,, for the start 
sequence number n of the new view. 

But nodes also need to determine how to order the dif- 
ferent OR messages that are present in the NEWVIEW 
message but not yet committed. Contained OR mes- 
sages (potentially including concurrent requests) are or- 
dered using a deterministic function of the requests that 
produces a total order for these requests. Having a fixed 
function allows all nodes receiving the NEWVIEW mes- 
sage to easily agree on the final order for the concurrent 
OR present in that message. Alternatively, we could let 
the primary replica propose an ordering, and disseminate 
it as an additional parameter of the NEWVIEW message. 

Replicas receiving the NEWVIEW message then exe- 
cute the requests in the OR messages according to that 
fixed order, updating their histories and history digests. 
If a replica has already executed some weak operations 
in an order that differs from the new ordering, it first rolls 
back the application state to the state of the last check- 
point (Section 4.8) and executes all operations after the 
checkpoint, starting with committed requests and then 
with the weak requests ordered by the NEWVIEW mes- 
sage. Finally, the replica broadcasts a VIEWCONFIRM 
message. As mentioned, when a replica collects match- 
ing VIEWCONFIRM messages on v, n, and h,, it becomes 
active in the new view. 

Our merge procedure re-executes the concurrent op- 
erations sequentially, without running any additional or 
alternative application-specific conflict resolution proce- 
dure. This makes the merge algorithm slightly simpler, 
but requires the application upcall that executes client op- 
erations to contain enough information to identify and re- 
solve concurrent operations. This is similar to the design 
choice made by Bayou [33] where special concurrency 
detection and merge procedure are part of each service 
operation, enabling servers to automatically detect and 
resolve conflicts. 


Limiting the number of merge operations. A faulty 
replica can trigger multiple merges by producing a new 
POD for each conflicting request in the same view, or 
generating PODs for requests in old views where itself 
or a colluding replica was the primary. To avoid this 
potential performance problem, replicas remember the 
last POD, POM, or a POA every other replica initiated, 
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and reject a POM/POD/POA from the same or a lower 
view coming from that replica. This ensures that a faulty 
replica can initiate a POD/POM/POA only once from 
each view it participated in. This, as we show in Sec- 
tion 5, helps establish our liveness properties. 


Recap comparison to Zyzzyva. Zeno’s view changes 
motivate our removal of the single-phase Zyzzyva op- 
timization for the following reason: suppose a strong 
client request REQ was executed (and committed) at se- 
quence number n at 3f +1 replicas. Now suppose there 
was a weak view change, the new primary is faulty, and 
only f+ 1 replicas are available. A faulty replica among 
those has the option of reporting REQ in a different or- 
der in its VIEWCHANGE message, which enables the 
primary to order REQ arbitrarily in its NEWVIEW mes- 
sage; this is possible because only a single—potentially 
faulty—treplica need report any request during a Zeno 
view change. This means that linearizability is violated 
for this strong, committed request REQ. Although it may 
be possible to design a more involved view change to 
preserve such orderings, we chose to keep things sim- 
ple instead. As our results show, in many settings where 
eventual consistency is sufficient for weak operations, 
our availability under partitions tramps any benefits from 
increased throughput due to the Zyzzyva’s optimized 
single-phase request commitment. 


4.8 Garbage Collection 


The protocol we have presented so far has two important 
shortcomings: the protocol state grows unboundedly, and 
weak requests are never committed unless they are fol- 
lowed by a strong request. 

To address these issues, Zeno periodically takes 
checkpoints, garbage collecting its logs of requests and 
forcing weak requests to be committed. 

When a replica receives an ORDERREQ message from 
the primary for sequence number M, it checks if M 
mod CHKP_INTERVAL = 0. If so, it broadcasts the 
COMMIT message corresponding to M to other repli- 
cas. Once a replica receives 2f + 1 COMMIT mes- 
sages matching in v, M, and hy, it creates the com- 
mit certificate for sequence number M. It then sends 
a (CHECKPOINT, v,M,hy,App)o, to all other replicas. 
The App is a snapshot of the application state after ex- 
ecuting requests upto and including M. When it receives 
f +1 matching CHECKPOINT messages, it considers the 
checkpoint stable, stores this proof, and discards all or- 
dered requests with sequence number lower than n along 
with their corresponding client requests. 

Also, in case the checkpoint procedure is not run 
within the interval of TcyKp time units, and a replica has 
some not yet committed ordered requests, the replica also 
initiates the commit step of the checkpoint procedure. 


This is done to make sure that pending ordered requests 
are committed when the service is rarely used by other 
clients and the sequence numbers grow very slowly. 


Our checkpoint procedure described so far poses a 
challenge to the protocol for detecting concurrent his- 
tories. Once old requests have been garbage-collected, 
there is no way to verify, in the case of a slow replica (or 
a malicious replica pretending to be slow) that presents 
an old request, if that request has been committed at that 
sequence number or if there is divergence. 


To address this, clients send sequential timestamps to 
uniquely identify each one of their own operations, and 
we added a list of per-client timestamps to the checkpoint 
messages, representing the maximum operation each 
client has executed up to the checkpoint. This is in con- 
trast with previous BFT replication protocols, including 
Zyzzyva, where clients identified operations using times- 
tamps obtained by reading their local clocks. Concretely, 
a replica sends (CHECKPOINT, v,M,hy,App,CSet);, 
where CSet is a vector of (c,t) tuples, where t is the 
timestamp of the last committed operation from c. 


This allows us to detect concurrent requests, even if 
some of the replicas have garbage-collected that request. 
Suppose a replica 7 receives an OR with sequence num- 
ber n that corresponds to client c’s request with times- 
tamp ¢;. Replica 7 first obtains the timestamp of the 
last executed operation of c in the highest checkpoint 
t-=CSer[c]. If t; <t,, then there is no divergence since 
the client request with timestamp ¢, has already been 
committed. But if t; > t., then we need to check if some 
other request was assigned n, providing a proof of diver- 
gence. If < M, then the CHECKPOINT and the OR form 
a POD since some other request was assigned n. Else, we 
can perform regular conflict detection procedure to iden- 
tify concurrency (see Section 4.6). 


Note that our checkpoints become stable only when 
there are at least 2+ 1 replicas that are able to agree. In 
the presence of partitions or other unreachability situa- 
tions where only weak quorums can talk to each other, it 
may not be possible to gather a checkpoint, which im- 
plies that Zeno must either allow the state concerning 
tentative operations to grow without bounds, or weaken 
its liveness guarantees. In our current protocol we chose 
the latter, and so replicas stop participating once they 
reach a maximum number of tentative operations they 
can execute, which could be determined based on their 
available storage resources (memory as well as the disk 
space). Garbage collecting weak operations and the re- 
sulting impact on conflict detection is left as a future 
work. 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 


177 


178 


5 Correctness 


In this section, we sketch the proof that Zeno satisfies the 
safety properties specified in Section 3. A proof sketch 
for liveness properties is presented in a separate technical 
report [31]. 

In Zeno, a (weak or strong) response is based on iden- 
tical histories of at least f +1 replicas, and, thus, at 
least one of these histories belongs to a correct replica. 
Hence, in the case that our garbage collection scheme 
is not initiated, we can reformulate the safety require- 
ments as follows: (S1) the local history maintained by 
a correct replica consists of a prefix of committed re- 
quests extended with a sequence of speculative requests, 
where no request appears twice, (S2) a request associ- 
ated with a correct client c appears, in a history at a 
correct replica only if c has previously issued the re- 
quest, and (S3) the committed prefixes of histories at 
every two correct replicas are related by containment, 
and (S4) at any time, the number of conflicting histories 
maintained at correct replica does not exceed maxhist = 
|(N — f’)/(f —f’+1)]|, where f’ is the number of cur- 
rently failed replicas and N is the total number of replicas 
required to tolerate a maximum of f faulty replicas. Here 
we say that two histories are conflicting if none of them 
is a prefix of the other. 

Properties (S1) and (S2) are implied by the state main- 
tenance mechanism of our protocol and the fact that only 
properly signed requests are put in a history by a correct 
replica. The special case when a prefix of a history is 
hidden behind a checkpoint is discussed later. 

A committed prefix of a history maintained at a correct 
replica can only be modified by a commitment of a new 
request or a merge operation. The sub-protocol of Zeno 
responsible for committing requests are analogous to the 
two-phase conservative commitment in Zyzzyva [23], 
and, similarly, guarantees that all committed requests are 
totally ordered. When two histories are merged at a cor- 
rect replica, the resulting history adopts the longest com- 
mitted prefix of the two histories. Thus, inductively, the 
committed prefixes of all histories maintained at correct 
replicas are related by containment (S3). 

Now suppose that at a given time, the number of con- 
flicting histories maintained at correct replica is more 
than maxhist. Our weak quorum mechanism guaran- 
tees that each history maintained at a correct process is 
supported by at least f+ 1 distinct processes (through 
sending SPECREPLY and REPLY messages). A correct 
process cannot concurrently acknowledge two conflict- 
ing histories. But when f” replicas are faulty, there can 
be at most |(n — f’)/(f —f’ + 1)]| sets of f +1 replicas 
that are disjoint in the set of correct ones. Thus, at least 
one correct replica acknowledged two conflicting histo- 
ries — a contradiction establishes (S4). 
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Checkpointing. Note that our garbage collection 
scheme may affect property (S1): the sequence of tenta- 
tive operations maintained at a correct replica may poten- 
tially include a committed but already garbage-collected 
operation. This, however, cannot happen: each round of 
garbage collection produces a checkpoint that contains 
the latest committed service state and the logical times- 
tamp of the latest committed operation of every client. 
Since no correct replica agrees to commit a request from 
a client unless its previous requests are already commit- 
ted, the checkpoint implies the set of timestamps of all 
committed requests of each client. If a replica receives an 
ordered request of a client c corresponding to a sequence 
number preceding the checkpoint state, and the times- 
tamp of this request is no later than the last committed 
request of c, then the replica simply ignores the request, 
concluding that the request is already committed. Hence, 
no request can appear in a local history twice. 


6 Evaluation 


We have implemented a prototype of Zeno as an exten- 
sion to the publicly available Zyzzyva source code [24]. 

Our evaluation tries to answer the following questions: 
(1) Does Zeno incur more overhead than existing proto- 
cols in the normal case? (2) Does Zeno provide higher 
availability compared to existing protocols when there 
are more than f unreachable nodes? (3) What is the cost 
of merges? 


Experimental setup. We set f = 1, and the minimum 
number of replicas to tolerate it, VN =3f+1=4. We vary 
the number of clients to increase load. Each physical ma- 
chine has a dual-core 2.8 GHz AMD processor with 4GB 
of memory, running a 2.6.20 Linux kernel. Each replica 
as well as a client runs on a dedicated physical machine. 
We use Modelnet [35] to simulate a network topology 
consisting of two hubs connected via a bi-directional link 
unless otherwise mentioned. Each hub has two servers in 
all of our experiments but client location varies as per the 
experiment. Each link has one-way latency of 1 ms and 
a 100 Mbps bandwidth. 


Transport protocols. Zyzzyva, like PBFT, uses multi- 
cast to reduce the cost of sending operations from clients 
to all replicas, so it uses UDP as a transport protocol and 
implements a simple backoff and retry policy to handle 
message loss. This is not optimized for periods of con- 
gestion and high message loss, such as those we ante- 
cipate during merges when the replicas that were parti- 
tioned need to bring each other up-to-date. To address 
this, Zeno uses TCP as the transport layer during the 
merge procedure but continues to use Zyzzyva’s UDP- 
based transport during normal operation and multicast- 
ing communication that is sent to all replicas. 
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Partition. We simulate network partitions by separat- 
ing the two hubs from each other. We vary the duration of 
the partitions from 1 to 5 minutes, based on the observa- 
tion by Chandra et al. [12] that a large fraction (> 75%) 
of network disconnectivity events range from 30 to 500 
seconds. 


6.1 Implementation 


Replacing PKI with MACs. Our Zeno prototype uses 
MACs instead of the slower digital signatures to imple- 
ment message authentication for the common-case, but 
still uses signatures for view changes. Using MACs in- 
duces some small mechanistic design changes over the 
protocol description in Section 4; these changes are stan- 
dard practice in similar protocols including Zyzzyva, and 
are presented in [31]. 


Merge. Replicas detect divergence by following the al- 
gorithm specified in Section 4.7. We implemented an 
optimization to the merge protocol where replicas first 
move to the higher view and then propagate their local 
uncommitted requests to the primary of the higher view. 
The primary of the higher view orders these requests as if 
they are received from the client and hence merges these 
requests in the history. 


6.2 Results 


We generate a workload with a varying fraction of strong 
and weak operations. If each client issued both strong 
and weak operations, then most clients would block soon 
after network partitions started. Instead, we simulate two 
kind of clients: (4) weak clients only issue weak requests 
and (11) strong clients always pose strong requests. This 
allows us to vary the ratio of weak operations (denoted 
by @) in the total workload with a limited number of 
clients in the system and long network partitions. We 
use a micro-benchmark that executes a no-op when the 
execute upcall for the client operation is invoked. 

We have also built a simple application on top of Zeno, 
emulating a shopping cart service with operations to add, 
remove, and checkout items based on a key-value data 
store. We also implement a simple conflict detection and 
merge procedure. Due to lack of space, the design and 
evaluation of this service is presented in the technical re- 
port [31]. 


Baich=10 
Zyzzyva (single phase) | 62 Kops/s | 88 Kops/s 
Zeno (weak) 60 Kops/s | 86 Kops/s 


Zeno (strong) 40 Kops/s | 82 Kops/s 
Zyzzyva (commit opt) | 40 Kops/s | 82 Kops/s 


Table 2: Peak throughput of Zeno and Zyzzyva. 





6.2.1 Maximum throughput in the normal case 


We compare the normal case performance of Zeno with 
Zyzzyva. In both systems we used the optimization of 
batching requests to reduce protocol overhead. In this 
experiment, the clients and servers are connected by a 
1 Gbps switch with 0.1 ms round trip latency. We ex- 
pect the peak throughput of Zeno with weak operations 
to approximately match the peak throughput of Zyzzyva 
since both can be completed in a single phase. However, 
the performance of Zeno with strong operations will be 
lower than the peak throughput of Zyzzyva since Zeno 
requires an extra phase to commit a strong operation. 

Our results presented in Table 2 show that Zeno 
and Zyzzyva’s throughput are similar, with Zyzzyva 
achieving slightly (3—6%) higher throughput than Zeno’s 
throughput for weak operations. The results also show 
that, with batching, Zeno’s throughput for strong op- 
erations is also close to Zyzzyva’s peak throughput: 
Zyzzyva has 7% higher throughput when the single 
phase optimization is employed. However, when a single 
replica is faulty or slow, Zyzzyva cannot achieve the sin- 
gle phase throughput and Zeno’s throughput for strong 
operations is identical to Zyzzyva’s performance with a 
faulty replica. 


6.2.2 Partition with no concurrency 


For all the remaining experiments, we use Modelnet 
setup and disable multicast since Modelnet does not sup- 
port it. We use a client population of 4 nodes, each send- 
ing a new request of minimal payload (2 Bytes) as soon 
as it has completed the previous request. This generates 
a steady load of approximately 500 requests/sec on the 
system. This is similar to an example SLA provided in 
Dynamo [15]. We use a batch size of 1 for both Zyzzyva 
and Zeno, since it is sufficient to handle the incoming 
request load. 

In this experiment, all clients reside in the first LAN. 
We initiate a partition at 90 seconds which continues for 
a minute. Since there are no clients in the second LAN, 
there are no requests processed in it and hence there is no 
concurrency, which avoids the cost of merging. Replicas 
with id O (primary for view initial view 0) and | reside 
in the first LAN while replicas with ids 2 and 3 reside in 
the second LAN. We also present the results of Zyzzyva 
to compare the performance in both normal cases as well 
as under the given failure. 


Varying @. We vary the mix of weak and strong opera- 
tions in the workload, and present the results in Figure 1. 
First, strong operations block as soon as the failure starts 
which is expected since not enough replicas are reach- 
able from the first LAN to complete the strong opera- 
tion. However, as soon as the partition heals, we observe 
that strong operations start to be completed. Note also 
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Figure 1: Two replicas are disconnected via a partition, 
that starts at time 90 and continues for 60 seconds. Pa- 
rameter & represents the fraction of weak operations in 
the workload. Note that the throughput of weak and 
strong operations in Zeno is presented separately for clar- 
ity. 

that Zyzzyva also blocks as soon as the failure starts and 
resumes as soon as it ends. 

Second, weak operations continue to be processed and 
completed during the partition and this is because Zeno 
requires (for f = 1) only 2 non-faulty replicas to com- 
plete the operation. The fraction of total requests com- 
pleted increases as q@ increases, essentially improving the 
availability of such operations despite network partitions. 

Third, when replicas in the other LAN are reachable 
again, they need to obtain the missing requests from the 
first LAN. Since the number of weak operations per- 
formed in the first LAN increases as o@ increases, the time 
to update the lagging replicas in the other partition also 
goes up; this puts a temporary strain on the network, ev- 
idenced by the dip in the throughput of weak operations 
when the partition heals. However, this dip is brief com- 
pared to the duration of the partition. We explore the 
impact of the duration of partitions next. 


Varying partition duration. Using the same setup, we 
now vary partition durations between | and 5 minutes 
for @ = 75%. For each partition duration, we measure 
the period of unavailability for both weak and strong op- 
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Figure 2: Varying partition durations with no concurrent 
operations. Baseline represents the minimal unavailabil- 


ity expected for strong operations, which is equal to the 
partition duration. 


erations. The unavailability is measured as the number 
of seconds for which the observed throughput, on either 
side of the partition, was less than 10% of the average 
throughput observed before the partition started. Also, 
the distance from the “Strong” line to the baseline (x = y) 
indicates how soon after healing the partition can strong 
operations be processed again. 


Figure 2 presents the results. We observe that weak 
operations are always available in this experiment since 
all weak operations were completed in the first LAN and 
the replicas in the first LAN are up-to-date with each 
other to process the next weak operation. Strong oper- 
ations are unavailable for the entire duration of the par- 
tition due to unavailability of the replicas in the second 
LAN and the additional unavailability is introduced by 
Zeno due to the operation transfer mechanism. However, 
the additional delay is within 4% of the partition duration 
(12 seconds for a 5 minute partition). Our current proto- 
type is not yet optimized and we believe that the delay 
could be further reduced. 


Varying request size. In this experiment, we simulate 
a partition for 60 seconds but increase the payload sizes 
from 2 Bytes to 1 KB, with an equally sized reply. The 
cumulative bandwidth of requests to be transferred from 
one LAN to the other is a function of the weak request 
offered load, the size of the requests, and the duration of 
the partition. With 60 seconds of partition and an offered 
load of 500 req/s, the cumulative request payload ranges 
from approximately 60 KB to 30 MB for 2 Bytes and 
1 KB request size respectively. The results we obtained 
are very similar to those in Figure 1 so we do not repeat 
them. These show that the time to bring replicas in the 
second LAN up-to-date does not increase significantly 
with the increase in request size. Given that we have 100 
Mbps links connecting replicas to each other, bandwidth 
is not a limiting resource for shipping operations at these 
offered loads. 
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Figure 3: Network partition for 60 seconds starting at 
time 90 seconds. Note that the throughput of weak and 
strong operations in Zeno is presented separately for clar- 


ity. 
6.2.3 Partition with concurrency 


In this experiment, we keep half the clients on each side 
of a partition. This ensures that both partitions observe 
a steady load of weak operations that will cause Zeno 
to first perform a weak view change and later merge the 
concurrent weak operations completed in each partition. 
Hence, this microbenchmark additionally evaluates the 
cost of weak view changes and the merge procedure. As 
before, the primary for the initial view resides in the first 
LAN. We measure the overall throughput of weak and 
strong operations completed in both partitions. Again, 
we compare our results to Zyzzyva. 


Varying a. Figure 3 presents the results for the 
throughput of different systems while varying the value 
of a. We observe three main points. 

When a = 0, Zeno does not give additional bene- 
fits since there are no weak operations to be completed. 
Also, as soon as the partition starts, strong operations are 
blocked and resume after the partition heals. As above, 
Zyzzyva provides greater throughput thanks to its single- 
phase execution of client requests, but it is as powerless 
to make progress during partitions as Zeno in the face of 
strong operations only. 

When a = 25%, we have only one client sending weak 


operations in one LAN. Since there are no conflicts, this 
graph matches that of Figure 1. 

When a > 50%, we have at least two weak clients, at 
least one in each LAN. When a partition starts, we ob- 
serve that the throughput of weak operations first drops; 
this happens because weak clients in the second parti- 
tion cannot complete operations as they are partitioned 
from the current primary. Once they perform the neces- 
sary view changes in the second LAN, they resume pro- 
cessing weak operations; this is observed by an increase 
in the overall throughput of weak operations completed 
since both partitions can now complete weak operations 
in parallel — in fact, faster than before the partition due 
to decreased cryptographic and message overheads and 
reduced round trip delay of clients in the second parti- 
tion from the primary in their partition. The duration 
of the weak operation unavailability in the non-primary 
partition is proportional to the number of view changes 
required. In our experiment, since replicas with ids 2 
and 3 reside in the second LAN, two view changes were 
required (to make replica 2 the new primary). 

When the partition heals, replicas in the first view de- 
tect the existence of concurrency and construct a POD, 
since replicas in the second LAN are in a higher view 
(with vy = 2). At this point, they request a NEWVIEW 
from the primary of view 2, move to view 2, and then 
propagate their locally executed weak operations to the 
primary of view 2. Next, replicas in the first LAN need 
to fetch the weak operations that completed in the sec- 
ond LAN and needs to complete them before the strong 
operations can make progress. This results in additional 
delay before the strong operations can complete, as ob- 
served in the figure. 


Varying partition duration. Next, we simulate parti- 
tions of varying duration as before, for @ = 75%. Again, 
we measure the unavailability of both strong and weak 
operations using the earlier definition: unavailability is 
the duration for which the throughput in either parti- 
tion was less than 10% of average throughput before 
the failure. With a longer partition duration, the cost of 
the merge procedure increases since the weak operations 
from both partitions have to be transferred prior to com- 
pleting the new client operations. 

Figure 4 presents the results. We observe that weak 
Operations experience some unavailability in this sce- 
nario, whose duration increases with the length of the 
partition. The unavailability for weak operations is 
within 9% of the total time of the partition. 

The unavailability of strong operations is at least the 
duration of the network partition plus the merge cost 
(similar to that for weak operations). The additional un- 
availability due to the merge operation is within 14% of 
the total time of the partition. 
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Figure 4: Varying partition durations with concurrent 
operations. Baseline represents the minimal unavailabil- 
ity expected for strong operations, which is equal to the 
partition duration. 


Varying execution cost and request load. In this ex- 
periment, we vary the execution cost of each operation as 
well as increase the request load, by increasing the num- 
ber of clients, to estimate the cost of merges when the 
system is loaded. For example, the system was operat- 
ing at peak cpu utilization with 20 clients and operations 
with 200 Us/operation or more. Here, we set & = 100%. 
We present results with a partition duration of 60 seconds 
in Figure 5. We observe that as the cost of operations 
system load increases, the unavailability of weak opera- 
tions also goes up. This is expected because the set of 
weak operations performed in one partition must be re- 
executed at the replicas in the other partition during the 
merge procedure. As the client load and the cost of op- 
eration execution increases, the time taken to re-execute 
the operation also increases. In particular, when the sys- 
tem is operating at 100% cpu utilization, the cost of re- 
executing the operations will take as much as time as the 
duration of the partition, and therefore the unavailability 
in these cases is higher than the partition duration. If, 
however, the system is not operating at peak utilization, 
the cost of merging is lower than the partition duration. 


Varying request size. We ran an experiment with a 5 
minute partition, and varying request sizes from 2 Bytes 
to 1 KB. The results with different request sizes were 
similar to those shown in Figure 3 so we do not plot them. 
We observed that increasing the payload size does not 
significantly affect the merge duration. This is due to the 
high speed network connection between replicas. 


Summary. Our microbenchmark results show that 
Zeno significantly improves the availability of weak op- 
erations and the cost of merging is reasonable as long 
as the system is not overloaded. This allows Zeno to 
quickly start processing strong operations soon after par- 
titions heal. 


6.2.4 Mix of strong and weak operations 


In this experiment, we allow each client to issue a mix of 
strong and weak operations. Note that as soon as a client 
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Figure 5: Varying execution cost of operations with in- 
creasing request load. 60 second partition duration. 


issues a strong operation in a partition, it will be blocked 
until the partition heals. We use a client population of 40 
nodes. Each client issues a strong operation with proba- 
bility p, weak operations with probability 0.8 — p, and 
exits from the system with a fixed probability of 0.2. 
We implement a fixed think time of 10 seconds between 
Operations issued by each client. The think times and 
the exit probability are obtained from the SpecWeb2005 
banking benchmark [10]. Next, we vary p to estimate 
the impact of failure events such as network partitions on 
the overall user experience. To give an idea of reference 
values for p, we looked into the types and frequencies 
of distinct operations in existing benchmarks. In an e- 
banking benchmark, and assigning the billing operations 
to be strong operations, the recommended frequency of 
such operations follows p = 0.13 [10]. In the case of 
an e-commerce benchmark, if the checkout operation is 
considered strong while the remaining, such as login, ac- 
cessing account information and customizations are con- 
sidered as weak operations, then we obtain p = 0.05 [1]. 
Our experimental results cover these values. 


We simulate a partition duration of 60 seconds and cal- 
culate the number of clients blocked and the length of 
time they were blocked during the partition. Figure 6 
presents the cumulative distribution function of clients 
on the y-axis and the maximum duration a client was 
blocked on the x-axis. This metric allows us to see how 
clients were affected by the partition. With Zyzzyva, all 
clients will be blocked for the entire duration of the par- 
tition. However, with Zeno, a large fraction of clients 
do not observe any wait time and this is because they 
exit from the system after doing a few weak operations. 
For example, more than 70% of clients do not observe 
any wait time as long as the probability of performing a 
strong operation is less than 15%. In summary, this result 
shows that Zeno significantly improves the user experi- 
ence and masks the failure events from being exposed 
to the user as long as the workload contains few strong 
operations. 
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Figure 6: Wait time per client with varying probability 
p of issuing strong operations. 


7 Related Work 


The trade-off between consistency, availability and tol- 
erance to network partitions in computing services has 
become folklore long ago [7]. 

Most replicated systems are designed to be “strongly” 
consistent, 1.e., provide clients with consistency guaran- 
tees that approximate the semantics of a single, correct 
server, such as single-copy serializability [20] or lineariz- 
ability [22]. 

Weaker consistency criteria, which allow for better 
availability and performance at the expense of letting 
replicas temporarily diverge and users see inconsistent 
data, were later proposed in the context of replicated ser- 
vices tolerating crash faults [17,30, 33,38]. We improve 
on this body of work by considering the more challeng- 
ing Byzantine-failure model, where, for instance, it may 
not suffice to apply an update at a single replica, since 
that replica may be malicious and fail to propagate it. 

There are many examples of Byzantine-fault tolerant 
state machine replication protocols, but the vast major- 
ity of them were designed to provide linearizable seman- 
tics [4,8, 11,23]. Similarly, Byzantine-quorum protocols 
provide other forms of strong consistency, such as safe, 
regular, or atomic register semantics [27]. We differ from 
this work by analyzing a new point in the consistency- 
availability tradeoff, where we favor high availability and 
performance over strong consistency. 

There are very few examples of Byzantine-fault toler- 
ant systems that provide weak consistency. 

SUNDR [25] and BFT2F [26] provide similar forms 
of weak consistency (fork and fork*, respectively) in 
a client-server system that tolerates Byzantine servers. 
While SUNDR is designed for an unreplicated service 
and is meant to minimize the trust placed on that server, 
BFT2F is a replicated service that tolerates a subset of 
Byzantine-faulty servers. A system with fork consis- 
tency might conceal users’ actions from each other, but if 
it does, users get divided into groups and the members of 
one group can no longer see any of another group’s file 
system operations. 


These two systems propose quite different consistency 
guarantees from the guarantees provided by Zeno, be- 
cause the weaker semantics in SUNDR and BFT2F have 
very different purposes than our own. Whereas we are 
trying to achieve high availability and good performance 
with up to f Byzantine faults, the goal in SUNDR and 
BFT2F is to provide the best possible semantics in the 
presence of a large fraction of malicious servers. In the 
case of SUNDR, this means the single server can be ma- 
licious, and in the case of BFT2F this means tolerating 
arbitrary failures of up to 5 of the servers. Thus they 
associate client signatures with updates such that, when 
such failures occur, all the malicious servers can do is 
conceal client updates from other clients. This makes the 
approach of these systems orthogonal and complemen- 
tary to our own. 

Another example of a system that provides weak con- 
sistency in the presence of some Byzantine failures can 
be found in [32]. However, the system aims at achieving 
extreme availability but provides almost no guarantees 
and relies on a trusted node for auditing. 

To our knowledge, this paper is the first to consider 
eventually-consistent Byzantine-fault tolerant generic 
replicated services. 


8 Future Work and Conclusions 


In this paper we presented Zeno, a BFT protocol that 
privileges availability and performance, at the expense 
of providing weaker semantics than traditional BFT pro- 
tocols. Yet Zeno provides eventual consistency, which 
is adequate for many of today’s replicated services, e.g., 
that serve as back-ends for e-commerce websites. Our 
evaluation of an implementation of Zeno shows it pro- 
vides better availability than existing BFT protocols, 
and that overheads are low, even during partitions and 
merges. 

Zeno is only a first step towards liberating highly avail- 
able but Byzantine-fault tolerant systems from the expen- 
sive burden of linearizability. Our eventual consistency 
may still be too strong for many real applications. For 
example, the shopping cart application does not neces- 
sarily care in what order cart insertions occur, now or 
eventually; this is probably the case for all operations 
that are associative and commutative, as well as oper- 
ations whose effects on system state can easily be rec- 
onciled using snapshots (as opposed to merging or to- 
tally ordering request histories). Defining required con- 
sistency per operation type and allowing the replication 
protocol to relax its overheads for the more “best-effort” 
kinds of requests could provide significant further bene- 
fits in designing high-performance systems that tolerate 
Byzantine faults. 
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Abstract 

This paper presents SPLAY, an integrated system that 
facilitates the design, deployment and testing of large- 
scale distributed applications. Unlike existing systems, 
SPLAY covers all aspects of the development and evalua- 
tion chain. It allows developers to express algorithms in 
a concise, simple language that highly resembles pseudo- 
code found in research papers. The execution environ- 
ment has low overheads and footprint, and provides a 
comprehensive set of libraries for common distributed 
systems operations. SPLAY applications are run by a 
set of daemons distributed on one or several testbeds. 
They execute in a sandboxed environment that shields the 
host system and enables SPLAY to also be used on non- 
dedicated platforms, in addition to classical testbeds like 
PlanetLab or ModelNet. A controller manages applica- 
tions, offering multi-criterion resource selection, deploy- 
ment control, and churn management by reproducing the 
system’s dynamics from traces or synthetic descriptions. 
SPLAY’s features, usefulness, performance and scalabil- 
ity are evaluated using deployment of representative ex- 
periments on PlanetLab and ModelNet clusters. 


1 Introduction 

Developing large-scale distributed applications is a 
highly complex, time-consuming and error-prone task. 
One of the main difficulties stems from the lack of ap- 
propriate tool sets for quickly prototyping, deploying and 
evaluating algorithms in real settings, when facing unpre- 
dictable communication and failure patterns. Nonethe- 
less, evaluation of distributed systems over real testbeds 
is highly desirable, as it is quite common to discover dis- 
crepancies between the expected behavior of an applica- 
tion as modeled or simulated and its actual behavior when 
deployed in a live network. 

While there exist a number of experimental testbeds 
to address this demand (e.g., PlanetLab [11], Model- 
Net [35], or Emulab [38]), they are unfortunately not used 
as systematically as they should. Indeed, our first-hand 
experience has convinced us that it is far from straight- 
forward to develop, deploy, execute and monitor appli- 
cations for them and the learning curve is usually slow. 
Technical difficulties are even higher when one wants to 
deploy an application on several testbeds, as deployment 
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scripts written for one testbed may not be directly usable 
for another, e.g., between PlanetLab and ModelNet. As a 
side effect of these difficulties, the performance of an ap- 
plication can be greatly impacted by the technical quality 
of its implementation and the skills of the person who 
deploys it, overshadowing features of the underlying al- 
gorithms and making comparisons potentially unsound 
or irrelevant. More dramatically, the complexity of us- 
ing existing testbeds discourages researchers, teachers, or 
more generally systems practitioners from fully exploit- 
ing these technologies. 

These various factors outline the need for novel 
development-deployment systems that would straightfor- 
wardly exploit existing testbeds and bridge the gap be- 
tween algorithmic specifications and live systems. For 
researchers, such a system would significantly shorten 
the delay experienced when moving from simulation to 
evaluation of large-scale distributed systems (“time-to- 
paper” gap). Teachers would use it to focus their lab work 
on the core of distributed programming—algorithms and 
protocols—and let students experience distributed sys- 
tems implementation in real settings with little effort. 
Practitioners could easily validate their applications in the 
most adverse conditions. 

There already exist several systems to ease the de- 
velopment or deployment process of distributed applica- 
tions. Tools like Mace [23] or P2 [26] assist the developer 
by generating code from a high-level description, but do 
not provide any facility for its deployment or evaluation. 
Tools such as Plush [9] or Weevil [37] help for the de- 
ployment process, but are restricted to situations where 
the user has control over the nodes composing the testbed 
(1.e., the ability to run programs remotely using ssh or 
similar). 

To address these limitations, we propose SPLAY, an in- 
frastructure that simplifies the prototyping, development, 
deployment and evaluation of large-scale systems. Un- 
like existing tools, SPLAY covers the whole chain of dis- 
tributed systems design and evaluation. It allows develop- 
ers to specify distributed applications in a concise manner 
using a platform-independent, lightweight and efficient 
language based on Lua [20]. For instance, a complete 
implementation of the Chord [33] distributed hash table 
(DHT) requires approximately 100 lines of code. 

SPLAY provides a secure and safe environment for ex- 
ecuting and monitoring applications, and allows for a 
simplified and unified usage of testbeds such as Planet- 
Lab, ModelNet, networks of idle workstations, or per- 
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sonal computers. SPLAY applications execute in a safe, 
sandboxed environment with controlled access to local 
resources (file system, network, memory) and can be in- 
stantiated on a large set of nodes with a single com- 
mand. SPLAY supports multi-user resource reservation 
and selection, orchestrates the deployment and monitors 
the whole system. It is particularly easy with SPLAY to 
reproduce a given live experiment or to control several 
experiments at the same time. 

An important component of SPLAY is its churn man- 
ager, which can reproduce the dynamics of a distributed 
system based on real traces or synthetic descriptions. 
This aspect is of paramount importance, as natural churn 
present in some testbeds such as PlanetLab 1s not repro- 
ducible, hence preventing a fair comparison of protocols 
under the very same conditions. 

SPLAY is designed for a broad range of usages, includ- 
ing: (1) deploying distributed systems whose lifetime 1s 
specified at runtime and usually short, e.g., distributing 
a large file using BitTorrent [17]; (11) executing long- 
running applications, such as an indexing service based 
on a DHT or a cooperative web cache, for which the 
population of nodes may dynamically evolve during the 
lifetime of the system (and where failed nodes must be 
replaced automatically); or (411) experimenting with dis- 
tributed algorithms, e.g., in the context of hands-on net- 
working class, by leveraging the isolation properties of 
SPLAY to enable execution of (possibly buggy) code on a 
shared testbed without interference. 


Contributions. This paper introduces a distributed in- 
frastructure that greatly simplifies the prototyping, devel- 
opment, deployment, and execution of large-scale dis- 
tributed systems and applications. SPLAY includes sev- 
eral original features—notably churn management, sup- 
port for mixed deployments, and platform-independent 
language and libraries—that make the evaluation and 
comparison of distributed systems much easier and fairer 
than with existing tools. 

We show how SPLAY applications can be concisely 
expressed with a specialized language that closely re- 
sembles the pseudo-code usually found in research pa- 
pers. We have implemented several well-known systems: 
Chord [33], Pastry [31], Scribe [15], SplitStream [14], 
BitTorrent [17], Cyclon [36], Erdos-Renyi epidemic 
broadcast [19] and various types of distribution trees [13]. 

Our system has been thoroughly evaluated along all its 
aspects: conciseness and ease of development, efficiency, 
scalability, stability and features. Experiments convey 
SPLAY’s good properties and the ability of the system to 
help practitioner and researcher alike through the whole 
distributed system design, implementation and evaluation 
chain. 

Roadmap. The remaining of this paper is organized 
as follows. We first discuss related work in Section 2. 
Section 3 gives an overview of the SPLAY architecture 
and elaborates on its design choices and rationales. In 
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Section 4, we illustrate the development process of a 
complete application (the Chord DHT [33]). Section 5 
presents a complete evaluation of SPLAY, using repre- 
sentative experiments and deployments (including tests 
of the Chord implementation of Section 4). Finally, we 
conclude in Section 6. 


2 Related Work 


SPLAY shares similarities with a large body of work in 
the area of concurrent and distributed systems. We only 
present systems that are closely related to our approach. 


Development tools. On the one hand, a set of new 
languages and libraries have been proposed to ease and 
speed up the development process of distributed applica- 
tions. 

Mace [23] is a toolkit that provides a wide set of tools 
and libraries to develop distributed applications using an 
event-driven approach. Mace defines a grammar to spec- 
ify finite state machines, which are then compiled to C++ 
code, implementing the event loop, timers, state tran- 
sitions, and message handling. The generated code is 
platform-dependent: this can prove to be a constraint in 
heterogeneous environments. Mace focuses on applica- 
tion development and provides good performance results 
but it does not provide any built-in facility for deploying 
or observing the generated distributed application. 

P2 [26] uses a declarative logic language named Over- 
Log to express overlays in a compact form by specifying 
data flows between nodes, using logical rules. While the 
resulting overlay descriptions are very succinct, specifi- 
cations in P2 are not natural to most network program- 
mers (programs are largely composed of table declara- 
tion statements and rules) and produce applications that 
are not very efficient. Similarly to Mace, P2 does not 
provide any support for deploying or monitoring applica- 
tions: the user has to write his/her own scripts and tools. 

Other domain-specific languages have been proposed 
for distributed systems development. In RTAG [10], pro- 
tocols are specified as a context-free grammar. Incoming 
messages trigger reduction of the rules, which express 
the sequence of events allowed by the protocol. Mor- 
pheus [8] and Prolac [24] target network protocols devel- 
opment. All these systems share the goal of SPLAY to 
provide easily readable yet efficient implementations, but 
are restricted to developing low-level network protocols, 
while SPLAY targets a broader range of distributed sys- 
tems. 


Deployment tools. On the other hand, several tools 
have been proposed to provide runtime facilities for dis- 
tributed applications developers by easing the deploy- 
ment and monitoring phase. 

Neko [34] is a set of libraries that abstract the net- 
work substrate for Java programs. A program that uses 
Neko can be executed without modifications either in 
simulations or in a real network, similarly to the NEST 
testbed [18]. Neko addresses simple deployment issues, 
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by using daemons on distant nodes to launch the virtual 
machines (JVMs). Nonetheless, Neko’s network library 
has been designed for simplicity rather than efficiency (as 
a result of using Java’s RMI), provides no isolation of de- 
ployed programs, and does not have built-in support for 
monitoring. This restricts its usage to controlled settings 
and small-scale experiments. 

Plush [9] is a set of tools for automatic deployment and 
monitoring of applications on large-scale testbeds such as 
PlanetLab [11]. Applications can be remotely compiled 
from source code on the target nodes. Similarly to Neko 
and SPLAY, Plush uses a set of application controllers 
(daemons) that run on each node of the system, and a 
centralized controller is responsible for managing the ex- 
ecution of the distributed application. 

Along the same lines, Weevil [37] automates the cre- 
ation of deployment scripts. A set of models is provided 
by the user to describe the experiment. An interesting 
feature of Weevil lies in its ability to replay a distributed 
workload (such as a set of request for a distributed mid- 
dleware infrastructure). These inputs can either be syn- 
thetically generated, or recorded from a previous run or 
simulation. The deployment phase does not include any 
node selection mechanism: the set of nodes and the map- 
ping of application instances to these nodes must be pro- 
vided by the user. The created scripts allow deployment 
and removal of the application, as well as the retrieval of 
outputs at the end of an experiment. 

Plush and Weevil share a set of limitations that make 
them unsuitable for our goals. First, and most impor- 
tantly, these systems propose high-end features for expe- 
rienced users on experimental platforms such as Planet- 
Lab, but cannot provide resource isolation due to their 
script-based nature. This restricts their usage to con- 
trolled testbeds, i.e., platforms on which the user has 
been granted some access rights, as opposed to non- 
dedicated environments such as networks of idle work- 
stations where it might not be desirable or possible to 
create accounts on the machines, and where the nature of 
the testbed imposes to restrict the usage of their resources 
(e.g., disk or network usage).Second, they do not provide 
any management of the dynamics (churn) of the system, 
despite its recognized usefulness [29] for distributed sys- 
tem evaluation. 


Testbeds. A set of experimental platforms, hereafter 
denoted as testbeds, have been built and proposed to the 
community. These testbeds are complementary to the 
languages and deployment systems presented in the first 
part of this section: they are the medium on which these 
tools operate. 

Distributed simulation platforms such as WiDS [25] al- 
low developers to run their application on top of an event- 
based network simulation layer. Distributed simulation 
is known to scale poorly, due to the high load of syn- 
chronization between nodes of the testbed hosting com- 
municating processes. WiDS alleviates this limitation by 


relaxing the synchronization model between processes 
on distinct nodes. Nonetheless, event-based simulation 
testbeds such as WiDS do not provide mechanisms to de- 
ploy or manage the distributed application under test. 

Network emulators such as Emulab [38], Model- 
Net [35], FlexLab [30] or P2PLab [28] can reproduce 
some of the characteristics of a networked environment: 
delays, bandwidth, packet drops, etc. They basically al- 
low users to evaluate unmodified applications across vari- 
ous network models. Applications are typically deployed 
in a local-area cluster and all communications are routed 
through some proxy node(s), which emulate the topology. 
Each machine in the cluster can host several end-nodes 
from the emulated topology. 

The PlanetLab [11] testbed (and forks such as Ever- 
lab [22]) allows experimenting in live networks by host- 
ing applications on a large set of geographically dispersed 
hosts. It is a very valuable infrastructure for testing dis- 
tributed applications in the most adverse conditions. 

SPLAY is designed to complement these systems. 
Testbeds are useful but, often, complex platforms. They 
require the user to know how to deploy applications, to 
have a good understanding of the target topology, and to 
be able to properly configure the environment for exe- 
cuting his/her application (for instance, one needs to use 
a specific library to override the IP address used by the 
application in a ModelNet cluster). In PlanetLab, it is 
time-consuming and error-prone to choose a set of non- 
overloaded nodes on which to test the application, to de- 
ploy and launch the program, and to retrieve the results. 
Finally, considering mixed deployments that use several 
testbeds at the same time for a single experiment would 
require to write even much more complex scripts (e.g., 
taking into account problems such as port range forward- 
ing). With SPLAY, as soon as the administrator who de- 
ployed the infrastructure has set up the network, using a 
complex testbed is as straightforward for the user as run- 
ning an application on a local machine. 


3 The SPLAY Framework 


We present the architecture of our system: its main com- 
ponents, its programming language, libraries and tools. 


3.1 Architecture 


The SPLAY framework consists of about 15,000 lines of 
code written in C, Lua, Ruby, and SQL, plus some third- 
party support libraries. Roughly speaking, the architec- 
ture is made of three major components. These compo- 
nents are depicted in Figure 1. 

e The controller, spl ayct1, is a trusted entity that con- 
trols the deployment and execution of applications. 

e A lightweight daemon process, spl ayd, runs on every 
machine of the testbed. A splayd instantiates, stops, 
and monitors SPLAY applications when instructed by the 
controller. 

e SPLAY applications execute in sandboxed processes 
forked by splayd daemons on participating hosts. 
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Figure 1: An illustration of two SPLAY applications (BitTorrent 
and Chord) at runtime. 
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Many SPLAY applications can run simultaneously on 
the same host. The testbed can be used transparently by 
multiple users deploying different applications on over- 
lapping sets of nodes, unless the controller has been con- 
figured for a single-user testbed. Two SPLAY applica- 
tions on the same node are unaware of each other (they 
cannot even exchange data via the file system); they can 
only communicate by message passing as for remote pro- 
cesses. Figure | illustrates the deployment of multiple ap- 
plications with a host participating to both a Chord DHT 
and a BitTorrent swarm. 

An important point is that SPLAY applications can be 
run locally with no modification to their code, while 
still using all libraries and language features proposed 
by SPLAY. Users can simply and quickly debug and test 
their programs locally, prior to deployment. 

We now discuss in more details the different compo- 
nents of the SPLAY architecture. 


Controller. The controller plays an essential role in our 
system. It is implemented as a set of cooperating pro- 
cesses and executes on one or several trusted servers. The 
only central component is a database that stores all data 
pertaining to participating hosts and applications. 

The controller (see Figure 2) keeps track of all active 
SPLAY daemons and applications in the system. Upon 
startup, a daemon initiates a secure connection (SSL) toa 
ctl process. For scalability reasons, there can be many 
ct 1 processes spread across several trusted hosts. These 
processes only need to access the shared database. 

SPLAY daemons open connections to log processes 
on behalf of the applications, if the logging library is 
used. This library is described in section 3.4. 

The deployment of a distributed application is achieved 
by submitting a job through a command-line or Web- 
based interface. SPLAY also provides a Web services API 
that can be used by other projects. Once registered in 
the database, jobs are handled by jobs processes. The 
nodes participating in the deployment can be specified 
explicitly as a list of hosts, or one can simply indicate 
the number of nodes on which deployment has to take 
place, regardless of their identity. One can also specify 
requirements in terms of resources that must be available 
at the participating nodes (e.g., bandwidth) or in terms 
of geographical location (e.g., nodes in a specific country 
or within a given distance from a position). Incremental 
deployment, i.e., adding nodes at different times, can be 
performed using several jobs or with the churn manager. 
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Figure 2: Architecture of the SPLAY controller (note that all 
components may be distributed on different machines). 











Each daemon is associated with records in the database 
that store information about the applications and active 
hosts running them, or scheduled for execution. The con- 
troller monitors the daemons and uses a session mecha- 
nism to tolerate short-term disconnections (1.e., a daemon 
is considered alive if it shows activity at least once during 
a given time period). Only after a long-term disconnec- 
tion (typically one hour) does the controller reset the sta- 
tus of the daemon and clean up the associated entries in 
the database. This task is under the responsibility of the 
unseen process. The blacklist process manages in 
the database a list of forbidden network addresses and 
masks; it piggybacks updates of this list onto messages 
sent to connected daemons. 

Communication between the daemon and the con- 
troller follows a simple request/answer protocol. The first 
request originates from the daemon that connects to the 
controller. Every subsequent command comes from the 
controller. For brevity, we only present here a minimal 
set of commands. 

The jobs process dequeues jobs from the database 
and searches for a set of hosts matching the constraints 
specified by the user. The controller sends a REGIS-— 
TER message to the daemons of every selected node. In 
case the identity of the nodes is not explicitly specified, 
the system selects a set larger than the one originally re- 
quested to account for failed or overloaded nodes. Upon 
accepting the job, a daemon sends to the controller the 
range of ports that are available to the application. Once 
it receives enough replies, the controller first sends to ev- 
ery selected daemon a LIST message with the addresses 
of some participating nodes (e.g., a single rendez-vous 
node or a random subset , depending on the application) 
to bootstrap the application, followed by a START mes- 
sage to begin execution. Supernumerary daemons that 
are slow to answer and active applications that must be 
terminated receive a FREE message. The state machine 
of a SPLAY job is as follows: 


REGISTER LIST START 


Hae rans “SEED grog Simin) 


The reason why we initially select a larger set of 
nodes than requested clearly appears when considering 
the availability of hosts on testbeds like PlanetLab, where 


USENIX Association 


USENIX Association 


transient failures and overloads are the norm rather than 
the exception. Figure 3 shows both the cumulative and 
discretized distributions of round-trip times (RTT) for a 
20 KB message over an already established TCP con- 
nection from the controller to PlanetLab hosts. One can 
observe that only 17.10% of the nodes reply within 250 
milliseconds, and over 45% need more than | second. Se- 
lecting a larger set of candidates allows us to choose the 
most responsive nodes for deploying the application. 
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Figure 3: RTT between the controller and PlanetLab hosts over 
pre-established TCP connections, with a 20 KB payload. 


Daemons. SPLAY daemons are installed on participat- 
ing hosts by a local user or administrator. The local ad- 
ministrator can configure the daemon via a configuration 
file, specifying various instance parameters (e.g., daemon 
name, access key, etc.) and restrictions on the resources 
available for SPLAY applications. These restrictions en- 
compass memory, network, and disk usage. If an applica- 
tion exceeds these limitations, it is killed (memory usage) 
or I/O operations fail (disk or network usage). The con- 
troller can specify stricter—but not weaker—restrictions 
at deployment time. 

Upon startup, a SPLAY daemon receives a blacklist of 
forbidden addresses expressed as IP or DNS masks. By 
default, the addresses of the controllers are blacklisted so 
that applications cannot actively connect to them. Black- 
lists can be updated by the controller at runtime (e.g., 
when adding a new daemon or for protecting a particu- 
lar machine). 

The daemon also receives the address of a log process 
to connect to for logging, together with a unique identifi- 
cation key. SPLAY applications instantiated by the local 
daemon can only connect to that log process; other pro- 
cesses will reject any connection request. 

3.2 Churn Management 

In order to fully understand the behavior and robust- 
ness of a distributed protocol, it is necessary to evalu- 
ate it under different churn conditions. Theses condi- 
tions can range from rare but unpredictable hardware fail- 
ures, to frequent application-level disconnections, as usu- 
ally found in user-driven peer-to-peer systems, or even 
to massive failures scenarios. It is also important to al- 
low comparison of competing algorithms under the very 
same churn scenarios. Relying on the natural, non- 
reproducible churn of testbeds such as PlanetLab often 
proves to be insufficient. 

There exist several characterizations of churn that can 
be leveraged to reproduce realistic conditions for the pro- 


tocol under test. First, synthetic descriptions issued from 
analytical studies [27] can be used to generate churn sce- 
narios and replay them in the system. Second, several 
traces of the dynamics of real networks have been made 
publicly available by the community (e.g., see the repos- 
itory at [1]); they cover a wide range of applications 
such as a highly churned file-sharing system [12] or high- 
performance computing clusters [32]. 
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Figure 4: Example of a synthetic churn description: script 
(left), binned number of joins/leave (right, bottom) and total 
number of nodes (right, top). 


SPLAY incorporates a component, churn (see Fig- 
ure 2), dedicated to churn management. This component 
can send instructions to the daemons for stopping and 
starting processes on-the-fly. Churn can be specified as 
a trace, in a format similar to that used by [1], or as a syn- 
thetic description written in a simple script language. The 
trace indicates explicitly when each node enters or leaves 
the system while the script allows users to express phases 
of the application’s lifetime, such as a steady increase or 
decrease of the number of peers over a given time du- 
ration, periods with continuous churn, massive failures, 
join flash crowds, etc. An example script is shown in 
Figure 4 together with a representation of the evolution 
of the node population and the number of arrivals and de- 
partures during each one-minute period: an initial set of 
nodes joins after 30 seconds, then the system stabilizes 
before a regular increase, a period with a constant popu- 
lation but a churn that sees half of the nodes leave and an 
equal number join, a massive failure of half of the nodes, 
another increase under high churn, and finally the depar- 
ture of all the nodes. 

Section 5.5 presents typical uses of the churn manage- 
ment mechanism in the evaluation of a large-scale dis- 
tributed system. It is noteworthy that the churn manage- 
ment system relieves the need for fault injection systems 
such as Loki [16]. Another typical use of the churn man- 
agement system is for long-running applications, e.g., a 
DHT that serves as a substrate for some other distributed 
application under test and needs to stay available for the 
whole duration of the experiments. In such a scenario, 
one can ask the churn manager to maintain a fixed-size 
population of nodes and to automatically bootstrap new 
ones as faults occur in the testbed. 

3.3. Language and Applications 

SPLAY applications are written in the Lua language [20], 
whose features are extended by SPLAY’s libraries. This 
design choice was dictated by four majors factors. First, 
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the most important reason is that Lua has unique features 
that allow to simply and efficiently implement sandbox- 
ing. As mentioned earlier, sandboxing is a sound basis 
for execution in non-dedicated environments, where re- 
sources need to be constrained and where the hosting op- 
erating system must be shielded from possibly buggy or 
ill-behaved code. Second, one of SPLAY’s goals is to 
support large numbers of processes within a single host 
of the testbed. This calls for a low footprint for both 
the daemons and the associated libraries. This excludes 
languages such as Java that require several megabytes 
of memory just for their execution environment. Third, 
SPLAY must ensure that the achieved performance is as 
good as the host system permits, and the features offered 
to the distributed system designer shall not interfere with 
the performance of the application. Fourth, SPLAY allows 
deployment of applications on any hardware and on any 
operating systems. This requires a “write-once, run ev- 
erywhere” approach that calls for either an interpreted or 
bytecode-based language. Lua’s unique features allow us 
to meet these goals of lightness, simplicity, performance, 
security and generality. 


Lua was designed from the ground up to be an effi- 
cient scripting language with very low footprint. Accord- 
ing to recent benchmarks [2], Lua is among the fastest 
interpreted scripting languages. It is reflective, impera- 
tive, and procedural with extensible semantics. Lua is dy- 
namically typed and has automatic memory management 
with incremental garbage collection. The small footprint 
from Lua results from its design that provides flexible 
and extensible meta-features, rather than a complete set 
of general-purpose facilities. The full interpreter is less 
than 200 kB and can be easily embedded. Applications 
can use libraries written in different languages (especially 
C/C++). This allows for low-level programming if need 
be. Our experiments (Section 5) highlight the lightness 
of SPLAY applications using Lua, in terms of memory 
footprint, load, and scalability. 


Lua’s interpreter can directly execute source code, 
as well as hardware-dependent (but operating system- 
independent) bytecode. In SPLAY, the favored way of 
submitting applications is in the form of source code, but 
bytecode programs are also supported (e.g., for intellec- 
tual property protection). 


Isolation and sandboxing are achieved thanks to Lua’s 
support for first-class functions with lexical scoping and 
closures, which allow us to restrict access to I/O and net- 
working libraries. We modify the behavior of these func- 
tions to implement the restrictions imposed by the admin- 
istrator or by the user at the time he/she submits the ap- 
plication for deployment over SPLAY. 


Lua also supports cooperative multitasking by the 
means of coroutines, which are at the core of SPLAY’s 
event-based model (discussed below). 
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Figure 5: Overview of the main SPLAY libraries. 


3.4 The Libraries 


SPLAY includes an extensible set of shared libraries (see 
Figure 5) tailored for the development of distributed ap- 
plications and overlays. These libraries are meant to be 
also used outside of the deployment system, when de- 
veloping the application. We briefly describe the major 
components of these libraries. 


Networking. The luasocket library provides basic 
networking facilities. We have wrapped it into a restricted 
socket library, sb_socket, which includes a security 
layer that can be controlled by the local administrator (the 
person who has instantiated the local daemon process) 
and further restricted remotely by the controller. This se- 
cure layer allows us to limit: (1) the total bandwidth avail- 
able for SPLAY applications (instantaneous bandwidth 
can be limited using shaping tools if need be); (2) the 
maximum number of sockets used by an application and 
(3) the addresses that an application can or cannot con- 
nect to. Restrictions are specified declaratively in con- 
figuration files by the local user that starts the daemon, 
or at the controller via the command-line and Web-based 
APIs. 

We have implemented higher-level abstractions for 
simplifying communication between remote processes. 
Our API supports message passing over TCP and UDP, 
as well as access to remote function and variables us- 
ing RPCs. Calling a remote function is almost as sim- 
ple as calling a local one (see code in next section). All 
arguments and return values are transparently serialized. 
Communication errors are reported using a second return 
value, as allowed by Lua. 

Finally, communication libraries can be instructed to 
drop a given proportion of the packets (specified upon 
deployment): this can be used to simulate lossy links and 
study their impact on an application. 


Sandboxed virtual filesystem. Overlays and dis- 
tributed applications often need to use the local file sys- 
tem. For instance, when instantiating the BitTorrent pro- 
tocol to replicate a large file on a set of nodes, temporary 
data must be written to disk as chunks are being received. 
Following our goal to not impact the hosting operating 
system, we need to ensure that a SPLAY application can- 
not access or overwrite any data on the host file system. 
To this end, SPLAY includes a library, sb_fs, that wraps 
the standard io library and provides restricted access to 
the file system in an OS-independent fashion. 

Our wrapped library simulates a file system inside a 
single directory. The library transparently maps a com- 
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plete path name to the underlying files that stores the ac- 
tual data, and applications can only read the files located 
in their private directory. The wrapped file handles en- 
force additional restrictions, such as limitations on the 
disk space and the number of opened files. 


Events, threads and locks. SPLAY proposes a thread- 
ing model based on Lua’s coroutines combined with 
event-based programming. Unlike preemptive threads, 
coroutines yield the processor to each other (cooperative 
multitasking). This happens at special points in base li- 
braries, typically when performing an operation that may 
block (e.g., disk or network I/O). This is typically trans- 
parent to the application developer. Although a single 
SPLAY application will not benefit from a multicore pro- 
cessor, coroutines are preferable to system-level threads 
for two reasons: their portability and their recognized ef- 
ficiency (low latency and high throughput) for programs 
that use many network connections (using either non- 
blocking or RPC-based programming), which is typical 
of distributed systems programming. Moreover, using a 
single process (at the operating system level) has a lower 
footprint, especially from a sandboxing perspective, and 
allows deploying more applications on each splayd. 

Shared data accesses are also safer with coroutines, as 
race conditions can only occur if the current thread yields 
the processor. This requires, however, a good understand- 
ing of the behavior of the application (we illustrate a com- 
mon pitfall in Section 4). SPLAY provides a lock library 
as a simple alternative to protect shared data from con- 
current accesses by multiple coroutines. 

We have also developed an event library, event s, that 
controls the main execution loop of the application, the 
scheduler, the communication between coroutines, time- 
outs, as well as event generation, waiting, and recep- 
tion. To integrate with the event library, we have wrapped 
the socket library to produce a non-blocking, coroutine- 
aware version sb_socket. All these layers are trans- 
parent to the SPLAY developer who only sees a restricted, 
non-blocking socket library. 


Logging. Animportant objective of SPLAY is to be able 
to quickly prototype and experiment with distributed al- 
gorithms. To that end, one must be able to easily debug 
and collect statistics about the SPLAY application at run- 
time. The log library allows the developer to print infor- 
mation either locally (screen, file) or, more interestingly, 
send it over the network to a log collector managed by the 
controller. If need be, the amount of data sent to the log 
collector can be restricted by a spl ayd, as instructed by 
the controller. As with most log libraries, facilities are 
provided to manage different log levels and dynamically 
enable or disable logging. 


Other libraries. SPLAY provides a few other libraries 
with facilities useful for developing distributed systems 
and applications. The 1 lenc and json libraries [3] sup- 
port automatic and efficient serialization of data to be sent 


to remote nodes over the network. We developed the first 
one, llenc, to simplify message passing over stream- 
oriented protocols (e.g., TCP). The library automatically 
performs message demarcation, computing buffer sizes 
and waiting for all packets of a message before deliv- 
ery. It uses the json library to automate encoding of any 
type of data structures using a compact and standardized 
data-interchange format. The crypto library includes 
cryptographic functions for data encryption and decryp- 
tion, secure hashing, signatures, etc. The misc library 
provides common containers, functions for format con- 
version, bit manipulation, high-precision timers and dis- 
tributed synchronization. 

The memory footprint of these libraries is remarkably 
small. The base size of a SPLAY application is less than 
600 kB with all the abovementioned libraries loaded. It 
is easy for administrators to deploy additional third-party 
software with the daemons, in the form of libraries. Lua 
has been design to seamlessly interact with C/C++, and 
other languages that bind to C can be used as well. For 
instance, we successfully linked some Splay application 
code with a third-party video transcoding library in C, for 
experimenting with adaptive video multicast. Obviously, 
the administrator is responsible for providing sandboxing 
in these libraries if required. 


4 Developing Applications with SPLAY 

This section illustrates the development of an application 
for SPLAY. We use the well-known Chord overlay [33] 
for its familiarity to the community. As we will see, 
the specification of this overlay is remarkably concise 
and close to the pseudo-code found in the original paper. 
We have successfully deployed this implementation on 
a ModelNet cluster and PlanetLab; results are presented 
in Section 5.2. The goal here is to provide the reader 
with a complete chain of development, deployment, and 
monitoring of a well-known distributed application. Note 
that local testing and debugging is generally done outside 
of the deployment framework (but still, using SPLAY li- 
braries). 

Chord is a distributed hash table (DHT) that maps keys 
to nodes in a peer-to-peer infrastructure. Any node can 
use the DHT substrate to determine the current live node 
that is responsible for a given key. When joining the net- 
work, a node receives a unique identifier (typically by 
hashing its IP address and port number) that determines 
its position in the identifier space. Nodes are organized 
in a ring according to their identifiers, and every node 
is responsible for the keys that fall between itself (inclu- 
sive) and its predecessor (exclusive). In addition to keep- 
ing track of their successors and predecessors on the ring, 
each node maintains a “finger” table whose entries point 
to nodes at an exponentially increasing distance from the 
current node’s position. More precisely, the i*” entry of a 
node with identifier n designates the live node responsi- 
ble for key n + 2°. Note that the successor is effectively 
the first entry in the finger table. 
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function join(n0) —— nO: some node in the ring 
predecessor = nil 
finger[1] = call(n0, {’find_successor’, n.id}) 
call(finger[1], {’notify’, n}) 
end 
function stabilize() —— periodically verify n’s successor 
local x = call(finger[1], ’ predecessor’) 
if x and between(x.id, n.id, finger[1].id, false, false) then 
finger[1] = x —— new successor 
end 
call(finger[1], {’notify’, n}) 
end 
function notify(n0) —— n0 thinks it might be our predecessor 
if not predecessor or between(n0.id, predecessor.id, n.id, false, false) then 
predecessor = nO —— new predecessor 
end 
end 
function fix_fingers() —— refresh fingers 
refresh = (refresh % m) + 1 ——1<refresh<m 


finger[refresh] = find_successor((n.id + 2*(refresh — 1)) % 2*m) 
end 


function check_predecessor() —— checks if predecessor has failed 
if predecessor and not ping(predecessor) then 
predecessor = nil 
end 
end 


Listing 1: SPLAY code for Chord overlay (stabilization). 


Listing 1 shows the code for the construction and main- 
tenance of the Chord overlay. For clarity, we only show 
here the basic algorithm that was proposed in [33] (the 
reader can appreciate the similarity between this code and 
Figure 6 of the referenced paper). 

Function join() allows a node to join the Chord 
ring. Only its successor is set: its predecessor and suc- 
cessor’s predecessor will be updated as part of the sta- 
bilization process. Function stabilize() _ periodi- 
cally verifies that a node is its own successor’s pre- 
decessor and notifies the successor. SPLAY base li- 
brary’s between call determines the inclusion of a 
value in a range, on a ring. Function notify () tells 
a node that its predecessor might be incorrect. Func- 
tion fix_fingers () iteratively refreshes fingers. Fi- 
nally, function check_predecessor () periodically 
checks if a node’s predecessor has failed. 

These functions are identical in their behavior and very 
similar in their form to those published in [33]. Yet, 
they correspond to executable code that can be readily 
deployed. The implementation of Chord illustrates a sub- 
tle problem that occurs frequently when developing dis- 
tributed applications from a high-level pseudo-code de- 
scription: the reception of multiple messages may trigger 
concurrent operations that perform conflicting modifica- 
tions on the state of the node. SPLAY’s coroutine model 
alleviates this problem in some, but not all, situations. 
During the blocking call to ping () on line 26 of List- 
ing 1, a remote call to notify() can update the pre- 
decessor, which may be erased on line 27 until the next 
remote call to notify(). This is not a major issue as 
it may only delay stabilization, not break consistency. It 
can be avoided by adding an extra check after the ping 
or, more generally, by using the locks provided by the 


SPLAY standard libraries (not shown here). 


30 function find_successor(id) —— ask node to find id’s successor 
31 if between(id, n.id, finger[1].id, false, true) —— inclusive for second bound 
32 return finger[1] 

33 end 


34. ~— local nO = closest_preceding_node(id) 
35 ‘return call(nO, {’find_successor’, id}) 


36 end 

37 function closest_preceding_node(id) —— finger preceding id 
38 for i=m, 1, —1 do 

39 if finger[i] and between(finger[i].id, n.id, id, false, false) then 

40 return finger[i] 

Al end 

42 end 

43 return n 

44 end 


Listing 2: SPLAY code for Chord overlay (lookup). 


Listing 2 shows the code for Chord lookup. 
Function find_successor() looks for _ the 
successor of a _ given identifier, while function 
closest_preceding_node() returns the highest 
predecessor of a given identifier found in the finger table. 
Again, one can appreciate the similarity with the original 
pseudo-code. 

This almost completes our minimal Chord implemen- 
tation, with the exception of the initialization code shown 
in Listing 3. One can specifically note the registration of 
periodic stabilization tasks and the invocation of the main 
event loop. 


1 require ’splay.base” —— events, misc, socket (core libraries) 

2 rpc =require ’splay.rpc” —— rpc (optional library) 

3 between, call, ping = misc.between-_c, rpc.call, rpc.ping —— aliases 
45 timeout = 5 —— stabilization frequency 
46 m= 24 —— 2" nodes and key with identifiers of length m 
47 n= job.me —— our node {ip, port, id} 
48 n.id = math.random(1, 2m) —— random position on ring 
49 predecessor = nil —— previous node on ring {id, ip, port} 
50 finger = {[1] =n} —— finger table with m entries 
51 refresh = 0 —— next finger to refresh 
52 nO = job.nodes[1] —— first peer is rendez—vous node 
53 rpc.server(n.port) —-— start rpc server 
54 events.thread(function() join(nO) end) —— join chord ring 
55 events.periodic(stabilize, timeout) —— periodically check successor, ... 
56 events.periodic(check_predecessor, timeout) —— predecessor, ... 
57 events.periodic(fix_fingers, timeout) —— and fingers 
58 events.loop() —— execute main loop 


Listing 3: SPLAY code for Chord overlay (initialization). 


While this code is quite classical in its form, the re- 
markable features are the conciseness of the implemen- 
tation, the closeness to pseudo-code, and the ease with 
which one can communicate with other nodes of the sys- 
tem by RPC. Of course, most of the complexity is hidden 
inside the SPLAY infrastructure. 

The presented implementation 1s not fault-tolerant. Al- 
though the goal of this paper is not to present the design 
of a fault-tolerant Chord, we briefly elaborate below on 
some steps needed to make Chord robust enough for run- 
ning on error-prone platforms such as PlanetLab. The 
first step is to take into account the absence of a reply to 
an RPC. Consider the call to predecessor in method 
stabilize(). One simply needs to replace this call 
by the code of Figure 4. 
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1 function stabilizeQ —— rpc.a_call() returns both status and results 
local ok, x = rpc.a_call(finger[1], predecessor’, 60) §©—— RPC, Im timeout 
if not ok then 

suspect(finger[1]) 
else 


(...) 


—— will prune the node out of local routing tables 


nn FW WN 


Listing 4: Fault-tolerant RPC call 


We omit the code of function suspect () for brevity. 
Depending on the reliability of the links, this function 
prunes the suspected node after a configurable number 
of missed replies. One can tune the RPC timeout accord- 
ing to the target platform (here, | minute instead of the 
standard 2 minutes), or use an adaptive strategy (e.g., ex- 
ponentially increasing timeouts). Finally, as suggested 
by [33] and similarly to the leafset structure used in Pas- 
try [31], we replace the single successor and predecessor 
by a list of 4 peers in each direction on the ring. 

Our Chord implementation without fault-tolerance is 
only 58 lines long, which represents an increase of 18% 
over the pseudo-code from the original paper (which does 
not contain initialization code, while our code does). Our 
fault-tolerant version is only 100 lines long, 1.e., 73% 
more than the base implementation (29% for fault tol- 
erance, and 44% for the leafset-like structure). We de- 
tail the procedure for deployment and the results obtained 
with both versions on a ModelNet cluster and on Planet- 
Lab, respectively, in Section 5.2. 


5 Evaluation 


This section presents a thorough evaluation of SPLAY 
performance and capabilities. Evaluating such an infras- 
tructure is a challenging task as the way users will use 
it plays an important role. Therefore, our goal in this 
evaluation is twofold: (1) to present the implementation, 
deployment and observation of real distributed systems 
by using SPLAY’s capability to easily reproduce experi- 
ments that are commonly used in evaluations and (2) to 
study the performance of SPLAY itself, both by compar- 
ing it to other widely-used implementations and by eval- 
uating its costs and scalability. The overall objective is to 
demonstrate the usefulness and benefits of SPLAY rather 
than evaluate the distributed applications themselves. We 
first demonstrate in Section 5.1 SPLAY’s capabilities to 
easily express complex system in a concise manner. We 
present in Section 5.2 the deployment and performance 
evaluation of the Chord DHT proposed in Section 4, us- 
ing a ModelNet [35] cluster and PlanetLab [11]. We then 
compare in Section 5.3 the performance and scalability of 
the Pastry [31] DHT written with SPLAY against a legacy 
Java implementation, FreePastry [4]. Sections 5.4 and 5.5 
evaluate SPLAY’s ability to easily (1) deploy applications 
in complex network settings (mixed PlanetLab and Mod- 
elNet deployment) and (2) reproduce arbitrary churn con- 
ditions. Section 5.6 focuses on SPLAY performance for 
deploying and undeploying applications on a testbed. We 
conclude in Section 5.7 with an evaluation of SPLAY’s 
performance with resource-intensive applications (tree- 


based content dissemination and long-term running of a 
cooperative Web cache). 

Experimental setup. Unless specified otherwise, our ex- 
perimentations were performed either on PlanetLab, us- 
ing a set of 400 to 450 hosts, or on our local cluster (11 
nodes, each equipped with a 2.13 Ghz Core 2 Duo pro- 
cessor and 2 GB of memory, linked by a 1 Gbps switched 
network). All nodes run GNU/Linux 2.6.9. A separate 
node running FreeBSD 4.11 is used as a ModelNet router, 
when required by the experiment. Our ModelNet con- 
figuration emulates 1,100 hosts connected to a 500-node 
transit-stub topology. The bandwidth is set to 1OMbps for 
all links. RTT between nodes of the same domain is 10 
ms, stub-stub and stub-transit RTT is 30 ms, and transit- 
transit (1.e., long range links) RTT is 100 ms. These set- 
tings result in delays that are approximately twice those 
experienced in PlanetLab. 


5.1 Development complexity 

We developed the following applications using SPLAY: 
Chord [33] and Pastry [31], two DHTs; Scribe [15], a 
publish-subscribe system; SplitStream [14], a bandwidth- 
intensive multicast protocol; a cooperative web-cache 
based on Pastry; BitTorrent [17], a content distribution 
infrastructure;' and Cyclon [36], a gossip-based member- 
ship management protocol. We have also implemented 
a number of classical algorithms, such as epidemic dif- 
fusion on Erdés-Renyi random graphs [19] and vari- 
ous types of distribution trees [13] (m-ary trees, paral- 
lel trees). As one can note from the following figure, all 
implementations are extremely concise in terms of lines 
of code (LOC). Note that we did not try to compact the 
code in a way that would impair readability. Numbers and 
darker bars represent LOC for the protocol, while lighter 
bars represent protocols acting as a substrate (Scribe and 
our Web cache are based on Pastry, SplitStream is based 
on both Pastry and Scribe): 








Chord | (base) MM 58 (base) + 17 (FT) + 26 (leafset) = 100 

Pastry 

Scribe Pastry es 79 
SplitStream Pastry Scribe NNN 58 
WebCache Pastry EES 85 





BitTorrent I rrr 420 
Cyclon EEE 93 
Epidemic ME 35 
Trees Mi 47 


Although the number of lines is clearly just a rough 
indicator of the expressiveness of a system, it is still a 
valuable metric to estimate programming efforts. Our 
implementations are systematically more compact than 
those written with Mace [23] (by approximately a factor 
of two) and comparable to P2’s [26] specifications. A 
well-documented protocol such as Chord only took a few 
hours to implement and debug. In contract, BitTorrent, 
being a complex and underspecified protocol, required 
several days of development. In both cases, the develop- 
ment process greatly benefited from the short deployment 
and testing phase, made almost trivial by SPLAY. 
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Figure 6: Performance results of Chord, deployed on a ModelNet cluster and on PlanetLab. 


5.2 Testing the Chord Implementation 

This section presents the deployment and performance re- 
sults of the Chord implementation from Section 4. We 
proceed with two deployments. First, the exact code pre- 
sented in this paper is deployed in a ModelNet testbed 
with no node failure. Second, a slightly modified version 
of this code is run on PlanetLab. This version includes the 
extensions presented at the end of Section 4: use of a leaf 
set instead of a single successor and a single predecessor, 
fault-tolerant RPCs, and shorter stabilization intervals. 


Chord on ModelNet. To parameterize the deployment 
of the Chord implementation presented in Section 4 on 
a testbed, we create a descriptor that describes resources 
requirements and limitations. The descriptor allows to 
further restrict memory, disk and network usage, and it 
specifies what information an application should receive 
when instantiated: 


BEGIN SPLAY RESOURCES RESERVATION 
nb_splayd 1000 
nodes head il 

END SPLAY RESOURCES RESERVATION ] ] 


== IL 


This descriptor requests 1,000 instances of the appli- 
cation and specifies that each instance will receive three 
essential pieces of information: (1) a single-element list 
containing the first node in the deployment sequence (to 
act as rendezvous node); (2) the rank of the current pro- 
cess in the deployment sequence; and (3) the identity of 
the current process (host and port). This information is 
useful to bootstrap the system without having to rely on 
external mechanisms such as a directory service. In the 
case of Chord, we use this information to have hosts join 
the network one after the other, with a delay between con- 
secutive joins to ensure that a single ring is created. A 
staggered join strategy allows better experiments repro- 
ducibility, but a massive join scenario would succeed as 
well. The following code is added: 

events.sleep(job.position) 
if #job.position > 1 then 


join(gob.nodes[1]) 
end 


—— Is between joins 
—— first node is rendez—vous node 


Finally, we register the Lua script and the deployment 
descriptor using one of the command line, Web service or 
Web-based interfaces. 

Each host runs 27 to 91 Chord nodes (we show in Sec- 
tion 5.3 that SPLAY can handle many more instances on a 
single host). During the experiment, each node injects 50 
random lookup requests in the system. We then undeploy 
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the overlay, and process the results obtained from the log- 
ging facility. Figure 6(a) presents the distribution of route 
lengths. Figure 6(b) presents the cumulative distribution 
of latencies. The average number of hops is below 82 
and the look-up time remains small. This supports our 
observations that SPLAY is efficient and does not intro- 


duce additional delays or overheads. 


Chord on PlanetLab. Next, we deploy our Chord im- 
plementation with extensions on 380 PlanetLab nodes 
and compare its performance with MIT’s fine-tuned C++ 
Chord implementation [5] in terms of delays when look- 
ing up random keys in the DHT. In both cases, we let 
the Chord overlay stabilizes before starting the measure- 
ments. Figure 6(c) presents the cumulative distribution of 
delays for 5000 random lookups (average route length is 
4.1 for both systems). We observe that MIT Chord out- 
performs Chord for SPLAY, because it relies on a cus- 
tom network layer that uses, amongst other optimiza- 
tions, network coordinates for constructing latency-aware 
finger tables. In contrast, we did not include such opti- 
mizations in our implementation. 


5.3. SPLAY Performance 


We evaluate the performance of applications using 
SPLAY in two ways. First, we evaluate the efficiency of 
the network libraries, based on the delays experienced by 
a sample application on a high-performance testbed. Sec- 
ond, we evaluate scalability: how many nodes can be run 
on a single host and what is the impact on performance. 
For these tests we chose Pastry [31] because: (4) it com- 
bines both TCP and UDP communications; (11) it requires 
efficient network libraries and transport layers, each node 
being potentially opening sockets and sending data to 
a large number of other peers; (111) it supports network 
proximity-based peer selection, and as such can be af- 
fected by fluctuating or unstable delays (for instance due 
to overload or scheduling issues). 

We compare our version of Pastry with FreePastry 
2.0 [4], a complete implementation of the Pastry proto- 
col in Java. Our implementation is functionally identi- 
cal to FreePastry and uses the very same protocols, e.g., 
locality-aware routing table construction and stabilization 
mechanisms to repair broken routing table entries. The 
only notable differences reside in the message formats 
(no wire compatibility) and the choice of alternate routes 
upon failure. 

We deployed FreePastry using all optimizations ad- 
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Figure 7: Comparisons of two implementations of Pastry: FreePastry and Pastry for SPLAY. 


vised by the authors, that is, running multiple nodes 
within the same JVM, replacing Java serialization with 
raw serialization, and keeping a pool of opened TCP con- 
nections to peers to avoid reopening recently used con- 
nections. We used 3 JVMs on our dual cores machines, 
each running multiple Pastry nodes. With a large set of 
nodes, our experiments have shown that this configura- 
tion yields slightly better results than using a single JVM, 
both in terms of delay and load. 


Figure 7(a) presents the cumulative delay distribution 
in a converged Pastry ring. The distribution of route 
lengths (not shown) is slightly better with FreePastry 
thanks to optimizations in the routing table management. 
Delays obtained with Pastry on SPLAY are much lower 
than the delays obtained with FreePastry. This experi- 
ment shows that SPLAY, while allowing for concise and 
readable protocol implementations, does not trade sim- 
plicity for efficiency. We also notice that Java-based pro- 
grams are often too heavyweight to be used with multi- 
ple instances on a single host.* This is further conveyed 
by our second experiment that compares the evolution of 
delays of FreePastry (Figure 7(b)) and Pastry for SPLAY 
(Figure 7(c)) as the number of nodes on the testbed in- 
creases. We use a percentile-based plotting method that 
allows expressing the evolution of a cumulative distribu- 
tion of delays with respect to the number of nodes. We 
can observe that: (1) delays start increasing exponentially 
for FreePastry when there are more than 1,600 nodes run- 
ning in the cluster, that is 145 nodes per host (recall that 
all nodes on a single host are hosted by only 3 JVMs and 
share most of their memory footprint); (2) it is not possi- 
ble to run more than 1,980 FreePastry nodes, as the sys- 
tem will start swapping, degrading performance dramat- 
ically; (3) SPLAY can handle 5,500 nodes (500 on each 
host) without significant drop in performance (other than 
the O(log NV) route lengths evolution, N being the num- 
ber of nodes). 


Figure 8 presents the load (1.e., average number of pro- 
cesses with “runnable” status, as reported by the Linux 
scheduler) and memory consumption per instance for 
varying number of instances. Each process is a Pastry 
node and issues a random request every minute. We ob- 
serve that the memory footprint of an instance is lower 
than 1.5 MB, with just a slight increase during the ex- 
periment as nodes fill their routing table. It takes 1,263 
Pastry instances before the host system starts swapping 


memory to disk. Load (averaged over the last minute) 
remains reasonably low, which explains the small delays 
presented by Figure 7(c). 
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Figure 8: Memory consumption and load evolution on a single 
node hosting several instances of Pastry for SPLAY. 


5.4 Complex Deployments 


SPLAY is designed to be used within a large set of differ- 
ent testbeds. Despite this diversity, it is sometimes also 
desirable to experiment with more than a single testbed at 
a time. For instance, one may want to evaluate a complex 
system with a set of peers linked by high bandwidth, non- 
lossy links, emulated by ModelNet, and a set of peers fac- 
ing adverse network conditions on PlanetLab. A typical 
usage would be to test a broker-based publish-subscribe 
infrastructure deployed on reliable nodes, along with a 
set of client nodes facing churn and lossy network links. 

Such a mixed deployment requires a deep understand- 
ing of the system for setting it up using scripting and 
common tools, as the user has to care about NAT and 
firewalls traversal, port forwarding, etc. The experiment 
presented in this section shows that such a complex mixed 
deployment can be achieved using SPLAY as if it were on 
a single testbed. The only precondition is that the admin- 
istrator of the part of the testbed that is behind a NAT 
or firewall defines (and opens) a range of ports that all 
splayds will use to communicate with other daemons 
outside the testbed. Notably for a ModelNet cluster, this 
operation can easily be done at the time Modelnet is in- 
stalled on the nodes of the testbed and it does not requires 
additional access rights. All other communication details 
are dealt with by SPLAY itself: no modification is needed 
to the application code. 

Figure 9 presents the delay distribution for a deploy- 
ment of 1,000 nodes on PlanetLab, on ModelNet, and in 
a mixed deployment over both testbeds at the same time 
(1.e., 500 nodes on each). We notice that the delays of the 
mixed deployment are distributed between the delays of 
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Figure 9: Pastry on PlanetLab, ModelNet, and both. 


PlanetLab and the higher delays of our ModelNet cluster. 
The “steps” on the ModelNet cumulative delays repre- 
sentation are a result of routes of increasing number of 
hops (both in Pastry and in the emulated topology), and 
the fixed delays for ModelNet links. 

5.5 Using Churn Management 

This section evaluates the use of the churn management 
module, both using traces and synthetic descriptions. Us- 
ing churn is as simple as launching a regular SPLAY ap- 
plication with a trace file as extra argument. SPLAY pro- 
vides a set of tools to generate and process trace files. 
One can, for instance, speed-up a trace, increase the churn 
amplitude whilst keeping its statistical properties, or gen- 
erate a trace from a synthetic description. 
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Figure 10: Using churn management to reproduce massive 
churn conditions for the SPLAY Pastry implementation. 


Figure 10 presents a typical experiment of a massive 
failure using the synthetic description. We ran Pastry on 
our local cluster with 1,500 nodes and, after 5 minutes, 
triggered a sudden failure of half of the network (750 
nodes). This models, for example, the disconnection of a 
inter-continental link or a WAN link between two corpo- 
rate LANs. We observe that the number of failed lookups 
reaches almost 50% after the massive failure due to rout- 
ing table entries referring to unreachable nodes. Pastry 
recovers all its routing capabilities in about 5 minutes 
and we can observe that delays actually decrease after 
the failure because the population has shrunk (delays are 
shown for successful routes only). While this scenario 
is amongst the simplest ones, churn descriptions allow 
users to experiment with much more complex scenarios, 
as discussed in Section 3.2. 

Our second experiment is representative of a complex 
test scenario that would usually involve much engineer- 
ing, testing and post-processing. We use the churn trace 
observed in the Overnet file sharing peer-to-peer sys- 
tem [12]. We want to observe the behavior of Pastry, 
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deployed on PlanetLab, when facing churn rates that are 
much beyond the natural churn rates suffered in Planet- 
Lab. As we want increasing levels of Churn, we simply 
“speed-up” the trace, that is, with a speed-up factor of 2x, 
5x or 10, a minute in the original trace is mapped to 30, 
12 or 6 seconds respectively. Figure 11 presents both the 
churn description and the evolution of delays and failure 
rates, for increasing levels of churn. The churn descrip- 
tion shows the population of nodes and the number of 
joins/leaves as a function of time, and performance ob- 
servations plot the evolution of the delay distribution as 
a function of time. We observe that (1) Pastry handles 
churn pretty well as we do not observe a significant fail- 
ure rate when as much as 14% of the nodes are changing 
state within a single minute; (2) running this experiment 
is neither more complex nor longer than on a single clus- 
ter without churn, as we did for Figure 7(a). Based on 
our own experience, we estimate that it takes at least one 
order of magnitude less human efforts to conduct this ex- 
periment using SPLAY than with any other deployment 
tools. We strongly believe that the availability of tools 
such as SPLAY will encourage the community to further 
test and deploy their protocols under adverse conditions, 
and to compare systems using published churn models. 


5.6 Deployment Performance 


This section presents an evaluation of the deployment 
time of an application on an adversarial testbed, Planet- 
Lab. This further conveys our position from Section 3.1 
that one needs to initially select a larger set of nodes than 
requested to ensure that one can rely on reasonably re- 
sponsive nodes for deploying the application. Tradition- 
ally, such a selection process is done by hand, or using 
simple heuristics based on the load or response time of 
the nodes. SPLAY relieves the need for the user to pro- 
ceed with this selection. Figure 12 presents the deploy- 
ment time for the Pastry application on PlanetLab. We 
vary the number of additionally probed daemons from 
10% to 100% of the requested nodes. We observe that 
a larger set results in lower delays for deploying an appli- 
cation (hence, presumably, lower delays for subsequent 
application communications). Nonetheless, the selection 
of a reasonably large superset for a proper selection of 
peers is a tradeoff between deployment delay and redun- 
dant messages sent over the network. Based on experi- 
ments, we use by default an initial superset of 125% of 
requested nodes. 


5.7 Resource-intensive Experiments 


Our two last experimental demonstrations deal with 
resource-intensive applications, both for short-term and 
long-term runs. They further conveys SPLAY’s ability to 
run in high performance settings and production environ- 
ments, as well as demonstrating that the obtained perfor- 
mance is similar to the one achieved with a dedicated 
implementation (particularly from the network point of 
view). We run the following two experiments: (1) the 
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Figure 12: Deployment times of Pastry for SPLAY, as a func- 
tion of (1) the number of nodes requested and (2) the size of the 
superset of daemons used. 


evaluation of a cooperative data distribution algorithm 
based on parallel trees using both SPLAY and a native (C) 
implementation on ModelNet and (2) a distributed coop- 
erative Web cache for HTTP accesses, which has been 
running for several weeks under a constant and signifi- 
cant load. 
Dissemination using trees. This experiment compares 
two versions of a simple cooperative protocol [13] based 
on parallel n-ary trees written with SPLAY and in C. We 
create n = 2 distinct trees in the same manner as Split- 
Stream [14] does: each of the 63 nodes is an inner mem- 
ber in one tree and a leaf in the other. The data to be trans- 
mitted is split into blocks, which are propagated along 
one of the 2 trees according to a round-robin policy. This 
experiment allows us to observe how SPLAY compares 
against a native application, CRCP, written in C [6]. Us- 
ing a tree for this comparison bears the advantage of high- 
lighting the additional delays and overheads of the plat- 
form and its network libraries (such as the sandboxing 
of network operations). These overheads accumulate at 
each level of the tree, from the root to the leaves. 

Tests were run in a ModelNet testbed configured with 
a symmetric bandwidth of 1 Mbps for each node. Results 
are shown in Figure 13 for binary trees, a 24 MB file, and 
different block sizes (16 KB, 128 KB, 512 KB). We ob- 
serve that both implementations produce similar results, 
which tends to demonstrate that the overhead of SPLAY’s 
language and libraries is negligible. Differences in shape 
are due to CRCP nodes sending chunks sequentially to 
their children, while SPLAY nodes send chunks in paral- 
lel. In our settings (1.e., homogeneous bandwidth), this 
should not change the completion time of the last peer as 
links are saturated at all times. 
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Figure 13: File distribution using trees. 


Long-term experiment: cooperative Web cache. Our 
last experiment presents the performance over time of a 
cooperative Web cache built using SPLAY following the 
same base design as Squirrel [21]. This experiment high- 
lights the ability of SPLAY to support long-run applica- 
tions under constant load. The cache uses our Pastry DHT 
implementation deployed in a cluster, with 100 nodes 
that proxy requests and store remote Web resources for 
speeding up subsequent accesses. For this experiment, 
we limit the number of entries stored by each nodes to 
100. Cached resources are evicted according to an LRU 
policy or when they are older than 120 seconds. The co- 
operative Web cache has been run for three weeks. Fig- 
ure 14 presents the evolution of HTTP requests delay dis- 
tribution for a period of 100 hours along with the cache 
hit ratio. We injected a continuous stream of 100 requests 
per second extracted from real Web access traces [7] cor- 
responding to 1.7 million hits to 42,000 different URLs. 
We observe a steady cache hit ratio of 77.6%. The experi- 
enced delays distribution has remained stable throughout 
the whole run of the application. Most accesses (75th per- 
centile) are cached and served in less than 25 to 100 ms, 
compared to non-cached accesses that require | to 2 sec- 
onds on average. 


6 Conclusion 


SPLAY is an infrastructure that aims at simplifying the 
development, deployment and evaluation of large-scale 
distributed applications. It incorporates several novel fea- 
tures not found in existing tools and testbeds. SPLAY 
applications are specified using in a high-level, efficient 
scripting language very close to pseudo-code commonly 
used by researchers in their publications. They execute 
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Figure 14: Cooperative Web cache: evolution of delays and 
cache hit ratios during a 4 days period. 


in a sandboxed environment and can thus be readily de- 
ployed on non-dedicated hosts. SPLAY also includes a 
comprehensive set of shared libraries tailored for the de- 
velopment of distributed protocols. Application specifi- 
cations are based on an event-driven model and are ex- 
tremely concise. 

SPLAY can seamlessly deploy applications in real (e.g., 
PlanetLab) or emulated (e.g., ModelNet) networks, as 
well as mixed environments. An original feature of 
SPLAY is its ability to inject churn in the system using 
a trace or a synthetic description to test applications in 
the most realistic conditions. Our thorough evaluation of 
SPLAY demonstrates that it allows developers to easily 
express complex systems in a concise yet readable man- 
ner, scales remarkably well thanks to its low footprint, 
exhibits very good performance in various deployment 
scenarios, and compares favorably against native appli- 
cations in our experiments. SPLAY is publicly available 
from http: //www.splay-project.org. 
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'Note that, without the requirement for binary compatibility, the 
size of our implementation could be significantly reduced. Our BitTor- 
rent implementation has been successfully used for downloading several 
times the Ubuntu Linux disk in official swarms. 


*This possibility is notably useful to test characteristics that do not 
depend much on the performance of individual nodes with a limited- 
size testbed, e.g., to evaluate the scalability of routing in an overlay. 
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Abstract 


Network emulation subjects real applications and pro- 
tocols to controlled network conditions. Most existing 
network emulators are fundamentally /ink emulators, not 
path emulators: they concentrate on faithful emulation 
of the transmission and queuing behavior of individual 
network hops in isolation, rather than a path as a whole. 
This presents an obstacle to constructing emulations of 
observed Internet paths, for which detailed parameters 
are difficult or impossible to obtain on a hop-by-hop ba- 
sis. For many experiments, however, the experimenter’s 
primary concern is the end-to-end behavior of paths, not 
the details of queues in the interior of the network. 

End-to-end measurements of many networks, includ- 
ing the Internet, are readily available and potentially pro- 
vide a good data source from which to construct realistic 
emulations. Directly using such measurements to drive 
a link emulator, however, exposes a fundamental dis- 
connect: link emulators model the capacity of resources 
such as link bandwidth and router queues, but when re- 
producing Internet paths, we generally wish to emulate 
the measured availability of these resources. 

In this paper, we identify a set of four principles for 
emulating entire paths. We use these principles to de- 
sign and implement a path emulator. All parameters to 
our model can be measured or derived from end-to-end 
observations of the Internet. We demonstrate our emu- 
lator’s ability to accurately recreate conditions observed 
on Internet paths. 


1 Introduction 


In network emulation, a real application or protocol, run- 
ning on real devices, is subjected to artificially induced 
network conditions. This gives experimenters the oppor- 
tunity to develop, debug, and evaluate networked sys- 
tems in an environment that is more representative of the 
Internet than a LAN, yet more controlled and predictable 
than running live across deployed networks such as the 
Internet. Due to these properties, network emulation has 
become a popular tool in the networking and distributed 
systems communities. 


Network emulators work by forwarding packets from 
an application under test through a set of queues that ap- 
proximate the behavior of router queues. By adjusting 
the parameters of these queues, an experimenter can con- 
trol the emulated capacity of a link, delay packets, and in- 
troduce packet loss. Popular network emulators include 
Dummynet [22], ModelNet [27], NIST Net [7], and Em- 
ulab (which uses Dummynet) [32]. These emulators fo- 
cus on link emulation, meaning that they concentrate on 
faithful emulation of individual links and queues. 

In many cases, particularly in distributed systems, the 
system under test runs on hosts at the edges of the net- 
work. Experiments on these systems are concerned with 
the end-to-end characteristics of the paths between hosts, 
not with the behavior of individual queues in the net- 
work. For such experiments, detailed modeling of in- 
dividual queues is not a necessity, so long as end-to-end 
properties are preserved. One way to create emulations 
with realistic conditions is to use parameters from real 
networks, such as the Internet, but it can be difficult or 
impossible to obtain the necessary level of detail to recre- 
ate real networks on a hop-by-hop basis. Thus, in order 
to run experiments using conditions from real networks, 
there is a clear need for a new type of emulator that mod- 
els paths as a whole rather than individual queues. 

In this paper, we identify a set of principles for path 
emulation and present the design and implementation of 
a new path emulator. This emulator uses an abstract and 
straightforward model of path behavior. Rather than re- 
quiring parameters for each hop in the path, it uses a 
much smaller set of parameters to describe the entire 
path. The parameters for our model can be estimated or 
derived from end-to-end measurements of Internet paths. 
In addition to the simplicity and efficiency benefits, this 
end-to-end focus makes our emulator suitable for recre- 
ating observed Internet paths inside a network testbed, 
such as Emulab, where experiments are predictable, re- 
peatable, and controlled. 


1.1 Path Emulation Approaches 


One approach to emulating paths is to use multiple in- 
stances of a link emulator, creating a series of queues for 
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the traffic under test to pass through, much like the series 
of routers it would pass through on a real path. Model- 
Net and Emulab in particular are designed for use in this 
fashion. Building a path emulator in this way, however, 
requires a router-level topology. While such topologies 
can be generated from models or obtained for particu- 
lar networks, obtaining detailed topologies for arbitrary 
Internet paths is very difficult. Worse, to construct an ac- 
curate emulation, capacity, queue size, and background 
traffic for each link in the path must be known, making 
reconstruction of Internet paths intractable. 

Another alternative is to approximate a path as a single 
link, using the desired end-to-end characteristics such as 
available bandwidth, and observed round-trip time, to set 
the parameters of a single link emulator. Because these 
properties can be measured from the edges of the net- 
work, this is an attractive approach. A recent survey of 
the distributed systems literature [29] shows that many 
distributed or network systems papers published in top 
venues [4, 5, 8, 18, 19, 23, 26, 30]—nearly one third of 
those surveyed—include a topology in which a single 
hop is used to approximate a path. 

On the surface, this seems like a reasonable approxi- 
mation: distributed systems tend to be sensitive to high- 
level network characteristics such as bandwidth, latency, 
and packet loss rather than the fine-grained queuing be- 
havior of every router along a path. However, as we dis- 
cuss in Section 2 and demonstrate in Section 4, using a 
single link emulator to model a measured path can of- 
ten fail even simple tests of accuracy. This is due to a 
fundamental mismatch between the fact that link emula- 
tors model the capacity of links, and the fact that end- 
to-end measurements reveal the availability of resources 
on those links. This difference can result in flows being 
unable to achieve the bandwidth set by the experimenter 
or seeing unrealistic round-trip times, and these errors 
can be quite large. This model also does not capture in- 
teractions between paths, such as shared bottlenecks, or 
within paths, such as the reactivity of background flows. 


1.2. Path Emulation Principles 


What is needed is a new approach to emulation that 
models entire Internet paths rather than individual links 
within those paths. We have identified four principles for 
designing such an emulator: 


e Model capacity and available bandwidth sepa- 
rately. Existing link emulators model links with 
limited capacity. We show why this is not always 
sufficient to create a path emulation with a partic- 
ular target available bandwidth. We provide the 
mathematical basis for deciding how much capacity 
and how much cross-traffic are necessary to produce 
the desired effect. 
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e Pick appropriate queue sizes. Much work has 
been done in choosing “good” values for queue 
sizes in real routers, but the issues that apply to em- 
ulation are somewhat different. We define concrete 
upper and lower bounds for queue sizes in emula- 
tion and simulation. These bounds are derived from 
the delay and available bandwidth parameters of the 
emulated paths to ensure that the configured band- 
width is actually achievable. 


e Use an abstracted model of the reactivity of 
background flows. Real networks have cross- 
traffic that reacts in complex ways to foreground 
traffic. Available bandwidth can change in reaction 
to foreground flows, and thus is a function of the 
load offered by the system under test. Discovering 
the characteristics of background traffic from the 
edge of the network is very difficult—even the de- 
gree of statistical multiplexing is obscured by TCP 
unfairness in the presence of disparate RTTs [15]. 
We show that we can model reactivity by concen- 
trating only on the effect that the reactivity of the 
background flows has on foreground flows. 


e Model shared bottlenecks. When modeling a set 
of paths, it is likely that some of those paths share 
bottlenecks, and that this will affect the properties 
seen by foreground flows. Such bottleneck sharing 
occurs naturally in router-level emulation, but must 
be explicitly modeled in an abstracted emulation. 


Note that any of these principles can, individually, be 
applied to a link emulator; indeed, our path emulator 
implementation, presented in Section 3, is based on the 
Dummynet link emulator. Our contribution lies in iden- 
tifying all four principles as being fundamental to path 
emulation, and in implementing a path emulator based on 
them so that they can be empirically evaluated. Although 
our focus in this paper is on emulation, these principles 
are also applicable to simulation. 


2 Path Modeling 


Our path model grows out of these four principles. It 
takes as input a set of five parameters: base round- 
trip time (RTT), available bandwidth (ABW), capacity, 
shared bottlenecks, and functions describing the reactiv- 
ity of background traffic. As shown in Section 3.3, it is 
possible to measure each of these parameters from end 
hosts on the Internet, making it feasible to build recon- 
structions of real paths. We discuss the ways in which 
these parameters are interrelated, and contrast our model 
with the approach of using end-to-end measurements as 
input to a single link emulator, showing the deficiencies 
of such an approach and how our model corrects them. 
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Our model focuses on accommodating foreground 
TCP flows, leaving emulation for other types of fore- 
ground flows as future work. We also concentrate on em- 
ulating stationary conditions for paths; in principle, any 
or all parameters to our model can be made time-varying 
to capture more dynamic network behavior. 


2.1 Base RTT 


The round-trip time (RTT) of a path is the time it takes 
for a packet to be transferred in one direction plus the 
time for an acknowledgment to be transferred in the op- 
posite direction. We model the RTT of a path by break- 
ing it into two components: the “base RTT” [6] (RTT pase) 
and the queuing delay of the bottleneck link. 

The base RTT includes the propagation, transmission, 
and processing delay for the entire path and the queuing 
delay of all non-bottleneck links. When the queue on 
the bottleneck link is empty, the RTT of the path is sim- 
ply the base RTT. In practice, the minimum RTT seen 
on a path is a good approximation of its base RTT. Be- 
cause transmission and propagation delays are constant, 
and processing delays for an individual flow tend to be 
stable, a period of low RTT indicates a period of little or 
no queuing delay. 

The base RTT represents the portion of delay that is 
relatively insensitive to network load offered by the fore- 
ground flows. This means that we do not need to emulate 
these network delays on a detailed hop-by-hop basis: a 
fixed delay for each path is sufficient. 


2.2 Capacity, Available Bandwidth, and 
Queuing 


The bottleneck link controls the bandwidth available on 
the path, contributes queuing delay to the RTT, and 
causes packet loss when its queue fills. Thus, three prop- 
erties of this link are closely intertwined: link capacity, 
available bandwidth, and queue size. 

We make the common assumption that there is only 
one bottleneck link on a path in a given direction [9] at a 
given time, though we do not assume that the same link 
is the bottleneck in both directions. 


2.2.1 Capacity and Available Bandwidth 


Existing link emulators fundamentally emulate limited 
capacity on links. The link speed given to the emula- 
tor is used to determine the rate at which packets drain 
from the emulator’s bandwidth queue, in the same way 
that a router’s queue empties at a rate governed by the 
capacity of the outgoing link. The quantity that more di- 
rectly affects distributed applications, however, is avail- 
able bandwidth, which we consider to be the maximum 
rate sustainable by a foreground TCP flow. This is the 


rate at which the foreground flow’s packets empty from 
the bottleneck queue. Assuming the existence of com- 
peting traffic, this rate is lower than the link’s capacity. 

It is not enough to emulate available bandwidth us- 
ing a capacity mechanism. Suppose that we set the ca- 
pacity of a link emulator using the available bandwidth 
measured on some Internet path: inside of the emulator, 
packets will drain more slowly than they do in the real 
world. This difference in rate can result in vastly dif- 
ferent queuing delays, which is not only disastrous for 
latency-sensitive experiments, but as we will show, can 
cause inaccurate bandwidth in the emulator as well. 

Let gz and gq; be the sizes of the bottleneck queues in 
the forward and reverse directions, respectively, and let 
Cy and C, be the capacities. The maximum time a packet 
may spend in a queue is 4, giving us a maximum RTT 
that can be observed on the path: 

df , ar 


RTT nax = RTT pase + Cy = C, 


(1) 

If we were to use ABWy and ABW,—the available 
bandwidth measured from some real Internet path—to 
set Cy and C;, Equation 1 would yield much larger queu- 
ing delays within the emulator than seen on the real path 
(assuming the queues sizes on the path and in the emula- 
tor are the same). 

For instance, consider a real path with R7Tpase = 
50ms, a bottleneck of symmetric capacity Cy = C; = 
43 Mbps (a T-3 link) and available bandwidth ABW; = 
ABW, = 4.3Mbps. For a small gy and g, of 64 KB (fil- 
lable by a single TCP flow), the RTT on the path is 
bounded at 74 ms, since the forward and reverse direc- 
tions each contribute at most 12ms of queuing delay. 
However, if we set Cy = C, = 4.3Mbps within an em- 
ulator (keeping queue sizes the same), each direction of 
the path can contribute up 120 ms of queuing delay. The 
total resulting RTT could reach as high as 290 ms. 

This unrealistically high RTT can lead to two prob- 
lems. First, it fails to accurately emulate the RTT of the 
real path, causing problems for latency-sensitive appli- 
cations. Second, it can also affect the bandwidth avail- 
able to TCP, a problem we discuss in more detail in 
Section 2.2.2. 

One approach reducing the maximum queuing delay 
would be to simply reduce the gy and gq, inside of the 
emulator. This may result in queues that are simply too 
small. In the example above, to reduce the queuing delay 
within the path emulator to the same level as the Inter- 
net path, we would we would have to reduce the queue 
size by a factor of 10 to 6.4 KB. A queue this small will 
cause packet loss if a stream sends a small burst of traf- 
fic, preventing TCP from achieving the requested avail- 
able bandwidth. We also discuss minimum queue size in 
more detail in Section 2.2.2. 
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The solution to these queuing problems is to separate 
the notions of capacity and available bandwidth in our 
path emulation model: they are independent parameters 
to each path. When we wish to emulate a path with com- 
peting traffic at the bottleneck, we set C > ABW. To 
model links with no background traffic, we can still set 
C = ABW, as is done implicitly in a link emulator. 

Of course, when C > ABW, we must fill the excess 
capacity to limit foreground flows to the desired ABW. 
A common solution to this problem has been to add 
a number of background TCP flows to the bottleneck. 
The problem with this technique is one of measurement. 
When the emulation is constructed using end-to-end ob- 
servations of a real path, discovering the precise behav- 
ior or even the number of competing background flows is 
not possible from the edges of the network. Adding reac- 
tive background flows to our emulation would not mirror 
the reactivity on the real network, and would result in an 
inexact ABW in the emulator. 

Since there is not enough information to replicate the 
background traffic at the bottleneck, we separately em- 
ulate its rate and its reactivity. We can precisely emu- 
late a particular level of background traffic using non- 
responsive, constant-bit-rate traffic. This mechanism al- 
lows us to provide an independent mechanism for emu- 
lating reactivity, described in Section 2.3. The reactivity 
model can change the level of background traffic to em- 
ulate responsiveness while providing a precise available 
bandwidth to the application at every point in time. 


2.2.2 Queue Size 


Much work has been done in choosing appropriate val- 
ues for queue sizes in real routers [1], but the set of con- 
straints for emulation are somewhat different: we have a 
relatively small set of foreground flows and a specific tar- 
get ABW that we wish to achieve. Although queue sizes 
can be provided directly as parameters to our model, we 
typically calculate them from other parameters. We do 
this for two reasons. First, it is difficult to measure the 
bottleneck queue size from the endpoints of the network 
due to interference from cross-traffic. Second, the bot- 
tleneck queue size affects applications only through ad- 
ditional latency or reduced bandwidth it might cause. 
Because our primary concern is emulating application- 
visible effects, we include a method for selecting a queue 
size that enables accurate emulation of those effects. 

We look at queue sizes in two ways: in terms of space 
(their capacity in bytes or packets) and in terms of time 
(the maximum queuing delay they may induce). This 
leads to two constraints on queue size: 


e The queue must be large enough in space that a TCP 
stream is able to get the full desired ABW; it should 
not drop bursts of packets. 
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e The queue must not be so large in time that the 
queuing delay from a full queue causes excessive 
RTTs, as seen in Equation 1. 


Lower bound. Finding the lower bound 1s straight- 
forward. Current best practices suggest that for a small 
number of flows, a good lower bound on queue size is 
the sum of the bandwidth-delay products of all flows 
traversing that link [1]. Here, a “small number” of flows 
is fewer than about 500. Because we are concerned 
only with flows of a foreground application, the num- 
ber of flows on a specific path will typically be much 
smaller than this. For a TCP flow f, the window size 
wr is roughly equal to its bandwidth-delay product, and 
is capped by Wma, the maximum window size allowed 
by the TCP implementation. Thus, for a given path in a 
given direction, we sum over the set of flows F, giving 
us a lower bound on g: 


qg> Y min(we,Wmax) (2) 
SCF 


This bound applies to the queues in both directions on 
the path, gf and g,. Intuitively, the queue must be large 
enough to hold at least one window’s worth of packets 
for each flow traversing the queue. 

Upper bound. The upper bound is more complex. 
The maximum RTT tolerable for a given flow on a given 
path, before it becomes window-limited, is given by 
(using the empirically derived TCP performance model 
demonstrated by Padhye et al. [16]): 


Win 
RTT nax = a aa (3) 





where ABW is the available bandwidth we wish the flow 
to experience. If the RTT grows above this limit, the 
bandwidth-delay product exceeds the maximum window 
SIZE Wmax, and the flow’s bandwidth will be limited by 
TCP itself, rather than the ABW we have set in the em- 
ulator. Since our goal is to accurately emulate the given 
ABW, this would result in an incorrect emulation. 

It is important to note that a single flow along a path 
cannot cause itself to become window-limited, as it will 
either fill up the bottleneck queue before it reaches Wg, 
or stabilize on an average queue occupancy no larger than 
Wmax. [wo or more flows, however, can induce this be- 
havior in each other by filling a queue to a greater depth 
than can be sustained by either one. Even flows crossing 
a bottleneck in opposite directions can cause excessive 
RTTs, as each flow’s ACK packets must wait in a queue 
with the other flow’s data packets. The value of Wg 
may be defined by several factors, including limitations 
of the TCP header and configuration options in the TCP 
stack, but is essentially known and fixed for a given ex- 
periment. 
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Flows may travel in both directions along a path, and 
while both will see the same RTT, they may have differ- 
ent RTTinqx values if the ABW on the path is not symmet- 
ric. Without loss of generality, we define the “forward” 
direction of the path to be the one with the higher ABW. 
From Equation 3, flows in this direction have the smaller 
RTT ax, and since we do not want either flow to become 
window-limited, we use ABW; to find the upper bound. 

Because most (Reno-derived) TCP stacks tend to 
reach a steady state in which the bottleneck queue is 
full [16], bottleneck queues tend to be nearly full, on av- 
erage. Thus, we can expect flows to experience RTTs 
near the maximum given by Equation | in steady-state 
operation. For our emulation of ABW to be accurate, 
then, Equation | (the maximum observable RTT) must 
be less than or equal to Equation 3 (the maximum toler- 
able RTT). If we set the two capacities to be equal and 
solve for the queue sizes, this gives us: 





Wm 
ay tarsC-( — - RTT (4) 


ABW; 
Because all terms on the right side are either fixed or pa- 
rameters of the path, we have a bound on the total queue 
size for the path. (It is not necessary for the forward and 
reverse capacities to be equal to solve the equation. We 
do so here for simplicity and clarity.) 

Setting the Queue Size. To select sizes for the queues 
on a path, we must simply split the total upper bound in 
Equation 4 between the two directions, in such a way that 
neither violates Equation 2. 

These two bounds have a very important property: it is 
not necessarily possible to satisfy both when C = ABW. 
When either bound is not met, the emulation will not pro- 
vide the desired network characteristics. The capacity C 
acts as a scaling factor on the upper bound. By adjust- 
ing it while holding ABW constant, we can raise or lower 
the maximum allowable queue size, making it possible 
to satisfy both equations. 

Figure 1 illustrates this principle by showing valid 
queue sizes as a function of capacity. As capacity 
changes, the upper bound increases while the lower 
bound remains constant. When capacity is at or near 
available bandwidth, the upper bound is below the lower 
bound, which means that no viable queue size can be se- 
lected. As capacity increases, these lines intersect and 
yield an expanding region of queue sizes that fulfill both 
constraints. This underscores the importance of emulat- 
ing available bandwidth and capacity separately. 

Asymmetry. Throughput artifacts due to violations 
of Equation 3 are exacerbated when traffic on the path 
is bidirectional and the available bandwidth is asymmet- 
ric. In this case, the flows in each direction can tolerate 
different maximum RTTs, with the flow in the forward 
(higher ABW) direction having the smaller upper bound. 
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Figure 1: The relationship between capacity and the bounds 
on queue size for a path with ABWy = ABW, = 10Mbps, 
RTT pase = 20ms, and Wnax = 65 KB. Low capacities prevent 
any viable queue size. 


This means that it is disproportionately affected by high 
RTTs. Others have described this phenomenon [2], and 
we demonstrate it empirically in Section 4.1. 


To determine how common paths with asymmetric 
ABW are in practice, we measured the available band- 
width on 7,939 paths between PlanetLab [17] nodes. Of 
those paths, 30% had greater ABW in one direction than 
the other by a ratio of at least 2:1, and 8% had a ratio of 
at least 10:1. Because links with asymmetric capacities 
(e.g., DSL and cable modems) are most common as last- 
mile links, and because PlanetLab has few nodes at such 
sites, it is highly likely that most of this asymmetry is 
a result of bottlenecks carrying asymmetric traffic. Our 
experiments in Section 4 shows that on a path with an 
available bandwidth asymmetry ratio as small as 1.5:1, 
a simple link emulation model that does not separate ca- 
pacity and ABW, and does not set queue sizes carefully, 
can result in a 30% error in achieved throughput. 


2.2.3 Putting It Together 


Figure 2 shows an overview of our model as described 
thus far. We model the bottleneck of a path with a queue 
that drains at a fixed rate, and a constant bit-rate cross- 
traffic source. The rate at which the queue drains is the 
capacity, and the difference between the injection rate of 
the cross-traffic and the capacity is the available band- 
width. The remainder of the delay on the path 1s mod- 
eled by delaying packets for a constant amount of time 
governed by R7TTpase._ The two halves of the path are 
modeled independently to allow for asymmetric paths. 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 


203 


204 


Path Emulator 











Bottleneck 


‘ Base 
ABW or \ RIT 








ABW 











ae 


Bottleneck CBR 


Traffic 














Figure 2: Modeling a single path, in both the forward and re- 
verse directions. 


2.3 Interactions Between Flows 


In addition to emulating the behavior of the foreground 
flows’ packets in the bottleneck queue, we must also em- 
ulate two important interactions: the interaction of mul- 
tiple foreground flows on different paths that share bot- 
tlenecks, and the interaction of foreground flows with re- 
sponsive background traffic. 

Shared Bottlenecks. To properly emulate sets of 
paths, we must take into account bottlenecks that are 
shared by more than one path. Consider the simple case 
in Figure 3. If we do not model the bottleneck BL2 
(shared by the paths from source S to destinations D2 
and D3), we will allow multiple paths to independently 
use bandwidth that should be shared between them. Do- 
ing so could result in the application getting significantly 
more bandwidth within the emulator than it would on the 
real paths [21]. 

We do not, however, need to know the full router-level 
topology of a set of paths in order to know that they share 
bottlenecks. Existing techniques [12] can detect the ex- 
istence of such bottlenecks from the edges of the net- 
work, by correlating the observed timings of simultane- 
ous packet transmissions on the paths. 

To model paths that share a bottleneck, we abstract 
shared bottlenecks in a simple manner: instead of giv- 
ing each path an independent bandwidth queue, we allow 
multiple paths to share the same queue. Traffic leaving a 
node is placed into the appropriate queue based on which 
destinations, if any, share bottlenecks from that source. 


This is illustrated in Figure 4: the two bottleneck links 
in the original topology are represented as bottleneck 
queues inside the path emulator. While paths sharing a 
bottleneck link share a bottleneck queue, each still has 
its own base RTT applied separately. Because base RTT 
represents links in the path other than the bottleneck link, 
links with a shared bottleneck do not necessarily have the 
same RTT. With this model, it is also possible for a path 
to pass through a different shared bottleneck in each di- 
rection. 
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Figure 3: A router-level topology, showing two bottleneck 
links. One (BL2) is shared by two paths from source S: the 
paths to destinations D2 and D3. 
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Figure 4: An abstracted view of Figure 3, with the bottleneck 
links represented as bottleneck queues. 
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Reactivity of Background Traffic. Flows traversing 
real Internet paths interact with cross-traffic, and this 
cross-traffic typically has some reactivity to the fore- 
ground flows. Thus, ABW on a path is not constant, even 
under the assumption that the set of background flows 
does not change. Simply setting a static ABW for a path 
can miss important effects: if more than one flow is sent 
along the path, the aggregate ABW available to all fore- 
ground flows may be greater, as the background traffic 
backs off further in reaction to the increased load. This 
is particularly important when the bottleneck is shared 
between two or more paths; the load on the bottleneck is 
the sum of the load on all paths that pass through it. 

While it is possible to create reactivity in the emula- 
tion by sending real, reactive cross-traffic (such as com- 
peting TCP flows) across the bottlenecks, doing so in a 
way that faithfully reproduces conditions on an observed 
link is problematic. The number, size, and RTT of these 
background flows all affect their reactivity, and such de- 
tail is not easily observed from endpoints. We turn to our 
guiding principle of abstraction, and model the reactivity 
of the background traffic to our foreground flows, rather 
than the details of the background traffic itself. 

We look at ABW as a function of offered load: 
ABWa(Lq) gives the aggregate bandwidth available in 
direction d (forward or reverse) of a given path, as a 
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Cr, C, fixed 


Capacity of the bottleneck in the forward and reverse directions. Fixed to value 


sufficient to make satisfaction of queue bounds possible for most experiments. 


ABWe(|F;|), | measured 
ABW, (|F;|) 


Table giving available bandwidth for the forward and reverse directions, as a func- 
tion of the number of flows traversing the path in that direction. 


Base RTT of the path, split evenly between the two directions. 


Table 1: The parameters to our path emulation. All parameters except S,cp are given on a per-path basis. 


queue size is as well. 


function of the offered load L, in that direction on that 
path. A set of such functions, one for each direction on 
each path, is supplied as a parameter to the emulation. 
Note that this offered load—and with it the available 
bandwidth—will likely vary over time during the emu- 
lated experiment. The ABW function can be created ana- 
lytically based on a model or it can be measured directly 
from a real path, by offering loads at different levels and 
observing the resulting throughput. The emulation can 
then provide—with high accuracy—exactly the desired 
ABW. Once we have used the reactivity functions to de- 
termine the aggregate bandwidth available on a path, we 
can set both the capacity and queue sizes as described in 
Section 2.2.2. 


Because an ABW function is a parameter of a particu- 
lar path, when multiple paths share a bottleneck, we must 
combine their functions. There are multiple ways that the 
ABW functions may be combined. Ideally, we would like 
to account for every possible combination of flows using 
every possible set of paths that share the bottleneck. The 
combinatorial explosion this creates, however, quickly 
makes this infeasible for even a modest number of paths. 
Instead, the simple strategy that we currently employ is 
to take the mean of the ABW values for each individual 
path sharing the bottleneck, weighted by the number of 
flows on each path. We are exploring the possibility that 
more complicated approaches may yield more realistic 
results. 


3 Implementing a Path Emulator 


Although the model we have discussed is applicable to 
both simulation and emulation, we chose to do our ini- 
tial implementation in an emulator. Our prototype path 
emulator is implemented as a set of enhancements to the 
Dummynet [22] link emulator. We constructed our pro- 
totype within the Emulab network testbed [32], but it is 
not fundamentally linked to that platform. 


A subset of paths p from the set of all paths P that share a common bottleneck. 
Multiple instances of this parameter may be given. 

The queue size for each bottleneck is derived from the measured values of ABW, 
RTThase, and the fixed capacity. If ABW is adjusted based on the reactivity table, 





3.1 Basis: The Dummynet Link Emulator 


Dummynet is a popular link emulator implemented in the 
FreeBSD kernel. It intercepts packets coming through 
an incoming network interface and places them in its in- 
ternal objects—called pipes—to emulate the effects of 
delay, limited bandwidth, and probabilistic random loss. 
Each pipe has one or more queues associated with it. 
Given the capacity or the delay of a pipe, Dummynet 
schedules packets to be emptied from the corresponding 
queues and places them on the outgoing interface. 

Dummynet can be configured to send a packet through 
multiple pipes on its path from an incoming interface to 
an outgoing interface. One pipe may enforce the base 
delay of the link, and a subsequent pipe may model the 
capacity of the link being emulated. Dummynet uses the 
IPFW packet filter to direct packets into pipes, and can 
therefore use many different criteria to map packets to 
pipes. 

In network emulation testbeds, “shaping nodes” are in- 
terposed on emulated links, each acting as a transparent 
bridge between the endpoints. In Emulab, the shaping 
nodes’ Dummynet is configured with one or more pipes 
to handle traffic in each direction on the emulated link, 
allowing for asymmetric link characteristics. Shaping 
nodes can also be used in LAN topologies by placing a 
shaping node between each node and switch implement- 
ing the LAN. Thus traffic between any two nodes passes 
through two shaping nodes: one between the source and 
the LAN, and one between the LAN and the destination. 


3.2 Enhancements for Path Emulation 


To turn Dummynet into a path emulator, we made a num- 
ber of enhancements to it. The parameters to the result- 
ing path emulator are summarized in Table 1. 

Capacity and Available Bandwidth. Dummynet im- 
plements bandwidth shaping in terms of a bandwidth 
pipe, which contains a bandwidth queue that is drained at 
a specified rate, modeling some capacity C. To separate 
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the emulation of capacity from available bandwidth, we 
modified Dummynet to insert “placeholder” packets into 
the bandwidth queues at regular, configurable intervals. 
These placeholder packets are neither received from nor 
sent to an actual network interface; their purpose is sim- 
ply to adjust the rate at which foreground flows’ packets 
move through the queue. The placeholders are sent at a 
constant bit rate of C— ABW, setting the bandwidth avail- 
able to the experimenter’s foreground flows. ABW can 
be set as a function of offered load, using the mechanism 
described below. 

Base Delay. We leave Dummynet’s mechanism for 
emulating the constant base delay unchanged. Packets 
pass through “delay” queues, where they remain for a 
fixed amount of time. 

Queue Size. We use Equation 4 to set the queue size 
for the bandwidth queues in each direction of each path, 
dividing the number of bytes equally between the for- 
ward and reverse directions. Because the model assumes 
that packets are dropped almost exclusively by the bottle- 
neck router, modeled by the bandwidth queues, the size 
of the delay queues is effectively infinite. 

Background Traffic Reactivity. We implement the 
ABW; and ABW, functions as a set of tables that are pa- 
rameters to the emulator. Each path is associated with a 
distinct table in each direction. We measure the offered 
load on a path by counting the number of foreground 
flows traversing that path. We do this for two reasons. 
First, it makes the measurement problem more tractable, 
allowing us to measure a relatively small, discrete set of 
possible offered loads on the real path. Second, our goal 
is to recreate inside the emulator the behavior that one 
would see by sending the same flows on the real network. 
The complex feedback system created by the interaction 
of foreground flows with background flows is captured 
most simply by measuring entire flows, as it is strongly 
related to TCP dynamics. It does have a downside, how- 
ever, in that it makes the assumption that the foreground 
flows will be full-speed TCP flows. During an execution 
of the emulator, a traffic monitor counts the number of 
active foreground flows on each path, and informs the 
emulator which table entry to use to set the aggregate 
ABW for the path. This target ABW is achieved inside 
the emulator by adjusting the rate at which placeholder 
packets enter the bandwidth queue. Our implementation 
also readjusts bottleneck queue sizes in reaction to these 
changes in available bandwidth. 

Shared Bottlenecks. We implement shared bottle- 
necks by allowing a bandwidth pipe to shape traffic to 
more than one destination simultaneously. For each end- 
point host in the topology, the emulator takes as a pa- 
rameter a set of “equivalence classes”: sets of destina- 
tion hosts that share a common bottleneck, and thus a 
common bandwidth pipe. Packets are directed into the 
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proper bandwidth pipe using IPFW rules. Our current 
implementation only supports bottlenecks that share a 
common source. We are in the process of extending our 
prototype to implement other kinds of bottlenecks, such 
as those that share a destination. 


3.3. Gathering Data from the Real World 


To create and run experiments with the path emulator, we 
need a source of input data for the parameters shown in 
Table 1. Although it is possible to synthesize values for 
these parameters, we concentrate here on gathering them 
from end-to-end measurements of the Internet. 

We developed a system for gathering data for these 
parameters using hosts in PlanetLab [17], which gives 
us a large number of end-site vantage points around the 
world. Each node in the emulation is paired with a 
PlanetLab node; measurements taken from the Planet- 
Lab node are used to configure the paths to and from the 
emulated node. 

To gather values for RTTpgse, we use simple ping 
packets, sent frequently over long periods of time [10]. 
The smallest RTT seen for a path is presumed to be an 
event in which the probe packet encountered no signif- 
icant queuing delay, and thus representative of the base 
RTT. 

To detect shared bottlenecks from a source to a set of 
destinations, we make use of a wavelet-based conges- 
tion detection tool [12]. This tool sends UDP probes 
from a source node to all destination nodes of inter- 
est and records the variations in one-way delays expe- 
rienced by the probe packets. Random noise introduced 
in the delays by non-bottleneck links is removed using 
a wavelet-based noise-removal technique. The paths are 
then grouped into different clusters, with all the paths 
from the source to the set of destinations going through 
the same shared bottleneck appearing in a single clus- 
ter. The shared bottlenecks found by this procedure are 
passed to the emulator as the S,<p sets. 

Our goal is that a TCP flow through the emulator 
should achieve the same throughput as a TCP flow sent 
along the real path. So, we use a definition of ABW that 
differs slightly from the standard one—we equate the 
available bandwidth on a path to the throughput achieved 
by a TCP flow. We also need to measure how this ABW 
changes in response to differing levels of foreground traf- 
fic. While we cannot observe the background traffic on 
the bottleneck directly, we can observe how different 
levels of foreground traffic result in different amounts 
of bandwidth available to that foreground traffic. Al- 
though packet-pair and packet-train [9, 13,20] measure- 
ment tools are efficient, they do not elicit reactions from 
background traffic. For this reason, we use the follow- 
ing methodology to concurrently estimate the ABW and 
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reactivity of background traffic on a particular path. 

To measure the reactivity of Internet cross-traffic to the 
foreground flows, we run a series of tests using iperf 
between each pair of PlanetLab nodes, with the number 
of concurrent flows ranging from one to ten. We use the 
values obtained from these tests between all paths of in- 
terest to build the reactivity tables for the path emulator. 
However, running such a test takes time: only one test 
can be active on each path at a time, and iperf must run 
long enough to reach a steady state. Thus, our measure- 
ments necessarily represent a large number of snapshots 
taken at different times, rather than a consistent snapshot 
taken at a single time. The cross-traffic on the bottleneck 
may vary significantly during this time frame. So, the 
reactivity numbers are an approximation of the behavior 
of cross-traffic at the bottleneck link. This is a general 
problem with measurements that must perturb the envi- 
ronment to differing levels. The time required to gather 
these measurements is also the main factor limiting the 
scale of our emulations. 

Another problem that arises is the proper ABW value 
for shared bottlenecks. Because paths that share a bottle- 
neck do not necessarily have the same RTTs, they may 
evoke different levels of response from reactive back- 
ground traffic. It is not feasible to measure every possible 
combination of flows on different paths through the same 
shared bottleneck. Thus, we use the approximation dis- 
cussed in Section 2.3 to set ABW for shared bottlenecks. 

Our current implementation does not measure the bot- 
tleneck link capacities Cy and C,. on PlanetLab paths, due 
to the difficulty of obtaining accurate packet timings on 
heavily loaded PlanetLab nodes [25]. We set the capac- 
ity of all bottleneck links to 100 Mbps. In practice, we 
find that for C >> ABW, the exact value of C makes little 
difference on the emulation, and thus we typically set it 
to a fixed value. We demonstrate this in Section 4.3. 


4 Evaluation 


The goal of our evaluation is to show that our path emu- 
lator accurately reproduces measurements taken from In- 
ternet paths. We demonstrate, using micro-benchmarks 
and a real application, that our path emulator meets this 
goal under conditions in which approximating the path 
using a single link emulator fails to do so. In the ex- 
periments described below, we concentrate on accurately 
reproducing TCP throughput and observed RTT. 

All of our experiments were run in Emulab on PCs 
with 3 GHz Pentium IV processors and 2GB of RAM. 
The nodes running application traffic used the Fedora 
Core Linux distribution with a 2.6.12 kernel, with its de- 
fault BIC-TCP implementation. The link emulator was 
Dummynet running in the FreeBSD 5.4 kernel, and our 
path emulator is a set of modifications to it. All mea- 


surements of Internet paths were taken using PlanetLab 
hosts. 


4.1 Effect on TCP Throughput 


We begin by running a micro-benchmark, iperf, a bulk- 
transfer tool that simply tries to achieve as much through- 
put as possible using a single TCP flow. 

We performed a series of experiments to compare the 
behavior of iperf when run on real Internet paths, an 
unmodified Dummynet link emulator, and our path em- 
ulator. We used a range of ABW and RTT values, some 
taken from measurements on PlanetLab and some syn- 
thetic. The ABW values from PlanetLab were measured 
using iperf, and thus the emulators’ accuracy can be 
judged by how closely iperf’s performance in the em- 
ulated environment matches the ABW parameter. In the 
link emulator, we set the capacity to the desired ABW (as 
there is only one bandwidth parameter), and in the path 
emulator, we set capacity to 100 Mbps. The link emu- 
lator uses Dummynet’s default queue size of 73 KB, and 
the path emulator’s queue size was set using Equation 4. 
Reactivity tables and shared bottlenecks were not used 
for these experiments. We started two TCP flows simul- 
taneously on the emulated path, one in each direction, 
and report the mean of five 60-second runs. 

The results of these experiments are shown in Table 2. 
It is clear from the percent errors that the path emulation 
achieves higher accuracy than the link emulator in many 
scenarios. While both achieve within 10% of the speci- 
fied throughput in the first test (a low-bandwidth, sym- 
metric path), as path asymmetry and bandwidth-delay 
product increase, the effects discussed in Section 2.2 
cause errors in the link emulator. While our path em- 
ulator remains within approximately 10% of the target 
ABW, the link emulator diverges by as much as 66%. 
The forward direction, with its higher throughput, tends 
to suffer disproportionately higher error rates. Because 
the measured values come from real Internet paths, they 
do not represent unusual or extreme conditions. 

The first two rows of synthetic results demonstrate 
that, even in cases of symmetric bandwidth, the failure to 
differentiate between capacity and available bandwidth 
hurts the link emulator’s accuracy. The third demon- 
strates divergence under highly asymmetric conditions. 

To evaluate the importance of selecting proper queue 
sizes, we reran two earlier experiments in our path em- 
ulator, this time setting the queue sizes greater than the 
upper limits allowed by Equation 4. These results are 
shown in the bottommost section of Table 2 (labeled 
“Bad Queue Size”). The RTT for each flow grows until 
the flows reach their maximum window sizes, preventing 
them from utilizing the full ABW of the emulated path 
and resulting in large errors. 
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Table 2: Throughput achieved by simultaneous TCP flows along both directions of a number of paths, using a link emulator and 


using our path emulator. 
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Figure 5: RTT over the lifetime of a 30-second TCP flow. Note that the range of the Y-axis in the center graph is seven times larger 


than the other two graphs. 


4.2 Effect on Round-Trip Time 


In addition to TCP throughput, our path emulator also 
has a significant effect on the RTT observed by a flow, 
producing RTTs much more similar to those on real paths 
than those seen in a simple link emulator. To evaluate this 
difference, we measured the path between the PlanetLab 
nodes at Harvard and those at Washington University in 
St. Louis (WUSTL). The ABW was 409 Kbps from Har- 
vard to WUSTL, and 4,530 Kbps from WUSTL to Har- 
vard. The base RTT was 50 ms. To isolate the effects of 
distinguishing ABW and capacity from other differences 
between the emulators, we set the queue size in both to 
the same value (our Linux kernel’s maximum window 
size of 32 KB), and exercised only one direction of the 
path. 
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Figure 5 shows the round-trip times seen during a 30- 
second iperf run from Harvard to WUSTL, and the 
round-trip times seen under both link and path emulation. 
Both emulators achieved the target bandwidth, but dra- 
matically differ in the round-trip times and packet-loss 
characteristics of the flows. Figure 5(b) and Figure 5(c) 
show the round-trip times observed on the link and path 
emulators respectively. As TCP tends to keeps the bot- 
tleneck queue full, it quickly plateaus at the length of the 
queue in time. Because the link emulator’s queue drains 
at the rate of ABW, rather than the much larger rate of C, 
packets spend much longer in the queue in the link emu- 
lator. The average RTT for the link emulator was 629 ms, 
an order of magnitude higher than the average RTT of 
53.1 ms observed on the actual path (Figure 5(a)). Be- 
cause the path emulator separates capacity and ABW, it 
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Figure 6: Experiments on an emulated path with 6.5 Mbps 
available bandwidth in the forward direction. A constant queue 
size 1s maintained while capacity is varied. 


gives an average RTT of 53.2 ms, which is within 1% of 
the value on the real path. The standard deviation inside 
of the path emulator is 3.0 ms, somewhat lower than the 
5.1 ms seen on the real path. 

To get comparable RTTs from the link emulator, its 
queue would have to be much smaller, around 2.5 KB, 
which is not large enough to hold two full-size TCP 
packets. We reran this experiment in the link emulator 
using this smaller queue size, and a unidirectional TCP 
flow was able to achieve close to the target 409 Kbps 
throughput. However, when we ran bidirectional flows, 
the flow along the reverse direction was only able to 
achieve a throughput of around 200 Kbps, despite the fact 
that the ABW in that direction was set to 4,530 Kbps (the 
value measured on the real path). This demonstrates that 
adjusting queue size by itself is not sufficient to fix ex- 
cessive RTTs, as it can cause significant errors in ABW 
emulation. 


4.3 Sensitivity Analysis of Capacity 


As we saw in Figure I, once the capacity has grown suf- 
ficiently large, it is possible to satisfy both the upper and 
lower bounds on queue size. Our next experiment tests 
how sensitive the emulator is to capacity values larger 
than this intersection point. 

We ran several trials with a fixed available bandwidth 
(6.5 Mbps) but varying levels of capacity. All other pa- 
rameters were left constant. Figure 6 shows the relative 
error in achievable throughput as we vary the capacity. 
While error peaks when capacity is very near available 
bandwidth, outside of that range, changing the capacity 
has very little effect on the emulation. This justifies the 
decision in our implementation to use a fixed, large ca- 
pacity, rather than measuring it for each path. 
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Figure 7: Time taken by participants in a BitTorrent swarm to 
download a file. Download times are shown for each node for 
both path and simple link emulation. 


4.4 BitTorrent Application Results 


We demonstrated in Section 4.1 that using path param- 
eters in a simple link emulator causes artifacts in many 
situations. We now show that these artifacts cause inac- 
curacies when running real applications and are not just 
revealed using measurement traffic. Though this experi- 
ment uses multiple paths, to isolate the effects of capacity 
and queue size, it does not model shared bottlenecks or 
reactivity. 

Figure 7 shows the download times of a group of Bit- 
Torrent clients using simple link emulation and path em- 
ulation with the same parameters, which were gathered 
from PlanetLab paths. Each pair of bars shows the time 
taken to download a fixed file on one of the twelve nodes. 
The simple link emulator limits available bandwidth in- 
accurately under some circumstances, which increases 
the download duration on many of the nodes. As seen in 
the figure, each node downloads an average of 6% slower 
in the link emulator than it does when under path emu- 
lator. The largest difference is 12%. This shows that the 
artifacts we observe with micro-benchmarks also affect 
the behavior of real applications. 


4.5 Network Reactivity 


Our next experiment examines the fidelity of our reactiv- 
ity model. We ran reactivity tests on a set of thirty paths 
between PlanetLab nodes. For each path, we measured 
aggregate available bandwidth with a varying number of 
foreground iperf flows, ranging from one to eight. We 
used this data as input to our emulator, in the form of 
reactivity tables, then repeated the experiments inside of 
the emulator. In this experiment, the paths are tested in- 
dependently at different times, so no shared bottlenecks 
are exercised. 
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Figure 8: A CDF showing percentage of error over paths with 
multiple foreground flows. 


By comparing the throughputs achieved inside of the 
emulator to those obtained on the real path, we can test 
the accuracy of our reactivity model. Figure 8 shows the 
results of this experiment. For each trial (a specific num- 
ber of foreground flows over a specific path), we com- 
puted the error as the percentage difference between the 
aggregate bandwidth measured on PlanetLab and that 
recreated inside the emulator. Our emulator was quite 
accurate; 80% of paths were emulated to within 20% of 
the target bandwidth. 


There are some outliers, however, with significant er- 
ror. These point to limitations of our implementation, 
which currently sets capacities to 100 Mbps and has a 
1 MB limit on the bottleneck queue size. Some paths in 
this experiment had very high ABW: as high as 78 Mbps 
in aggregate for eight foreground flows. As we saw in 
Figure 6, when ABW is close to capacity, significant er- 
rors can result. With high bandwidths and multiple flows, 
the lower bound on queue sizes (Equation 2) also be- 
comes quite large, producing two sources of error. First, 
if this bound becomes larger than our 1 MB implemen- 
tation limit, we are unable to provide sufficient queue 
space for all flows to achieve full throughput. Second, 
our limits on capacity limit the amount we can adjust the 
upper bound on queue size, Equation 4, meaning that we 
may end up in a situation where it is not possible to sat- 
isfy both the upper and lower bounds. 


It would be possible to raise these limits in our im- 
plementation by improving bandwidth shaping efficiency 
and allowing larger queue sizes. The underlying issues 
are fundamental ones, however, and would reappear at 
higher bandwidths: our emulator requires that capacity 
be significantly larger than the available bandwidth to be 
emulated, and providing emulation for large numbers of 
flows with high ABW requires large queues. 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 


CDF of Bottleneck Bandwidths 












































Cumulative Fraction of Paths 





Simple Link Emulator 
Path Emulator 





0 2 4 6 8 10 12 14 16 
Bandwidth (Mbps) 


Figure 9: A CDF showing bandwidth achieved at shared bot- 
tlenecks. 


4.6 Shared Bottlenecks 


Finally, we examine the effects of shared bottlenecks on 
bandwidth. We again measured the paths between a set 
of PlanetLab nodes, finding their bandwidth, reactivity, 
and shared bottlenecks After characterizing the paths in 
the real world, we configured two emulations. The first 
is a simple link emulation, approximating each path with 
an independent link emulator. The second uses our full 
path emulator, including its modeling of shared bottle- 
necks and reactivity. In order to stress and measure the 
system, we simultaneously ran an instance of iperf in 
both directions between every pair of nodes. This causes 
competition on the shared bottlenecks and also ensures 
that every path is being exercised in both directions at 
the same time. 


Figure 9 shows a CDF of the bandwidth achieved at 
the bottlenecks in both the link emulator and our path 
emulator, demonstrating that failure to model shared bot- 
tlenecks results in higher bandwidth. To isolate the ef- 
fects of shared bottlenecks and reactivity, only flows 
passing through those bottlenecks are shown. In the link 
emulator, each flow receives the full bandwidth mea- 
sured for the path. In the path emulator, flows pass- 
ing through shared bottlenecks are forced to compete 
for this bandwidth, and as a result, each receives less 
of it. Modeling of reactivity plays an important role 
here: in the path emulator, each shared bottleneck is 
being exercised by multiple flows, and thus the aggre- 
gate bandwidth available is affected by the response of 
the cross-traffic. The few cases in which the path em- 
ulator achieves higher bandwidth than the link emulator 
are caused by highly asymmetric paths, where the effects 
demonstrated in Section 4.1 dominate. 
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5 Related Work 


There is a large body of work on measuring the Internet 
and characterizing its paths. The focus of our work is not 
to create novel measurement techniques, but to create ac- 
curate emulations based on existing techniques. Our con- 
tribution lies in the identification of principles that can be 
used to accurately emulate paths, given these measure- 
ments. 


Our emulator builds on the Emulab [32] and Dum- 
mynet [22] link emulators to reproduce measured end- 
to-end path characteristics. ModelNet [27] also emulates 
router-level topologies on a link-by-link basis. Capacity 
and delay are set for each link on the path. To create 
shared bottlenecks with a certain degree of reactivity, it 
is up to the experimenter to carefully craft a router topol- 
ogy and introduce cross-traffic on a particular link of the 
path. ModelNet includes tools for simplifying router- 
level topologies, but does not abstract them as heavily 
as we do in this work. NIST Net [7], a Linux-based net- 
work emulator, is an alternative to Dummynet. However, 
it is also a link emulator and does not distinguish be- 
tween capacity and available bandwidth. Our model ab- 
stracts the important characteristics of the path, thereby 
simplifying their specification and faithfully reproducing 
those network conditions without the need for a detailed 
router-level topology. 


Appenzeller et al. [1] show that the queuing buffer re- 
quirements for a router can be reduced provided that a 
large number of TCP flows are passing through the router 
and they are desynchronized. They also provide reason- 
ing as to why setting the queue sizes to the bandwidth- 
delay product works for a reasonably small number of 
TCP flows. We use the bandwidth delay product as the 
lower limit on the queue sizes of the paths being mod- 
eled. We are also concerned about low capacity links 
(asymmetric or otherwise) causing large queuing delays 
that adversely affect the throughput of TCP. Our model 
separates capacity from available bandwidth and deter- 
mines queue sizes such that the TCP flows on the path 
do not become window-size limited. 


Researchers have investigated the effects of capac- 
ity and available bandwidth asymmetry on TCP perfor- 
mance [2,3, 11,14]. They proposed modifications to e1- 
ther the bottleneck router forwarding mechanism, or the 
end node TCP stack. We do not seek to minimize the 
queue sizes at the router, but rather to calculate the right 
queue size for a path to enable the foreground TCP flows 
to fully utilize the ABW during emulation. We mod- 
ify neither router forwarding nor the TCP stack and our 
model is independent of the TCP implementation used 
on the end nodes. 

Harpoon [24], Swing [28], and Tmix [31] are frame- 
works that characterize the traffic passing through a link 


and then generate statistically similar traffic for emu- 
lating that link or providing realistic workloads. Our 
work, in contrast, does not seek to characterize or re- 
create background traffic in great detail. We characterize 
cross-traffic at a much higher level, solely in terms of 
its reactivity to foreground flows. We are able to do this 
characterization with end-to-end measurements, and do 
not need to directly observe the packets comprising the 
cross-traffic. 


6 Conclusion and Future Work 


We have presented and evaluated a new path emulator 
that can accurately recreate the observed end-to-end con- 
ditions of Internet paths. The path model within our 
emulator is based on four principles that combine to 
enable accurate emulation over a wide range of condi- 
tions. We have compared our approach to two alterna- 
tives that make use of simple link emulation. Unlike 
router-level emulation of paths, our approach is suitable 
for reconstructing real paths solely from measurements 
taken from the edges of a network. As we have shown, 
using a single link emulator to approximate a measured 
multi-hop path can fail to produce accurate results. Our 
path model corrects these problems, enabling recreations 
of real paths in the repeatable, controlled environment of 
an emulator. 

Much of our future work will concentrate on improv- 
ing the reactivity portion of our model. Our method of 
measuring reactivity is currently the most intensive part 
of our data gathering: it uses the most bandwidth, and 
takes the most time. Improving it will allow our system 
to run at larger scale. Viewing ABW as a function of 
the number of full-speed foreground TCP flows limits us 
both to TCP and to applications that are able to fill their 
network paths. In future refinements of our design, we 
hope to characterize ABW in terms of lower-level met- 
rics that are not intrinsically linked to TCP’s congestion 
control behavior. Finally, our averaging of ABW values 
for paths that share a bottleneck could use more study 
and validation. 

Another future direction will be the expansion of our 
work to the simulation domain. Simulators handle links 
and paths in much the same way as do emulators, and the 
model we describe in Section 2 can be directly applied to 
them as well. 
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Abstract 


MODIST is the first model checker designed for transparently 
checking unmodified distributed systems running on unmod- 
ified operating systems. It achieves this transparency via a 
novel architecture: a thin interposition layer exposes all ac- 
tions in a distributed system and a centralized, OS-independent 
model checking engine explores these actions systematically. 
We made MODIST practical through three techniques: an ex- 
ecution engine to simulate consistent, deterministic executions 
and failures; a virtual clock mechanism to avoid false positives 
and false negatives; and a state exploration framework to incor- 
porate heuristics for efficient error detection. 

We implemented MODIST on Windows and applied it to 
three well-tested distributed systems: Berkeley DB, a widely 
used open source database; MPS, a deployed Paxos implemen- 
tation; and PACIFICA, a primary-backup replication protocol 
implementation. MODIST found 35 bugs in total. Most im- 
portantly, it found protocol-level bugs (1.e., flaws in the core 
distributed protocols) in every system checked: 10 in total, in- 
cluding 2 in Berkeley DB, 2 in MPS, and 6 in PACIFICA. 


1 Introduction 


Despite their growing popularity and importance, dis- 
tributed systems remain difficult to get right. These sys- 
tems have to cope with a practically infinite number of 
network conditions and failures, resulting in complex 
protocols and even more complex implementations. This 
complexity often leads to corner-case errors that are dif- 
ficult to test, and, once detected in the field, impossible 
to reproduce. 

Model checking has been shown effective at detect- 
ing subtle bugs in real distributed system implementa- 
tions [19, 27]. These tools systematically enumerate the 
possible execution paths of a distributed system by start- 
ing from an initial state and repeatedly performing all 
possible actions to this state and its successors. This 
state-space exploration makes rare actions such as net- 
work failures appear as often as common ones, thereby 
quickly driving the target system (i1.e., the system we 
check) into corner cases where subtle bugs surface. 

To make model checking effective, it 1s crucial to ex- 
pose the actions a distributed system can perform and do 
so at an appropriate level. Previous model checkers for 
distributed systems tended to place this burden on users, 
who have to either write (or rewrite) their systems in a 


restricted language that explicitly annotates event han- 
dlers [19], or heavily modify their system to shoehorn it 
into a model checker [27]. 

This paper presents MODIST, a system that checks un- 
modified distributed systems running on unmodified op- 
erating systems. It simulates a variety of network con- 
ditions and failures such as message reordering, network 
partitions, and machine crashes. The effort required to 
start checking a distributed system is simply to provide 
a simple configuration file specifying how to start the 
distributed system. MODIST spawns this system in the 
native environment the system runs within, infers what 
actions the system can do by transparently interposing 
between the application and the operating system (OS), 
and systematically explores these actions with a cen- 
tralized, OS-independent model checking engine. We 
have carefully engineered MODIST to ensure the exe- 
cutions MODIST explores and the failures it injects are 
consistent and deterministic: inconsistency creates false 
positives that are painful to diagnose; non-determinism 
makes it hard to reproduce detected errors. 

Real distributed systems tend to rely on timeouts for 
failure detection (e.g., leases [14]); many of these time- 
outs hide in branch statements (e.g., “if (now > t + 
timeout)”. To find bugs in the rarely tested timeout 
handling code, MODIST provides a virtual clock mech- 
anism to explore timeouts systematically using a novel 
static symbolic analysis technique. Compared to the 
state-of-the-art symbolic analysis techniques [3, 4, 13, 
31], our method reduces analysis complexity using the 
following two insights: (1) programmers use time val- 
ues in simple ways (e.g., arithmetic operations) and (2) 
programmers check timeouts soon after they query the 
current time (e.g., by calling gettimeofday ()). 

We implemented MODIST on Windows. We applied 
it to three well-tested distributed systems: Berkeley DB, 
a widely used open-source database; MPS, a Paxos im- 
plementation that has managed production data centers 
with more than 1OOK machines for over two years; and 
PACIFICA, a primary-backup replication protocol imple- 
mentation. MODIST found 35 bugs in total. In particular, 
it found protocol-level bugs (1.e., flaws in the core proto- 
cols) in every system checked: 10 in total, including 2 in 
Berkeley DB, 2 in MPS, and 6 in PACIFICA. We mea- 
sured the speed of MODIST and found that (1) MODIST 
incurs reasonable overhead (up to 56.5%) as a checking 
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tool and (2) it can speed up a checked execution (up to 
216 times faster) using its virtual clock. 

MODIST provides a customizable framework for in- 
corporating various state-space exploration strategies. 
Using this framework, we implemented dynamic partial 
order reduction (DPOR) [9], random exploration, depth- 
first exploration, and their variations. Among these, 
DPOR is a strategy well-known in the model checking 
community for avoiding redundancy in exploration. To 
evaluate these strategies, we measured their protocol- 
level coverage (i.e., unique protocol states explored). 
The results show that, while DPOR achieves good cov- 
erage for a small bounded state space, it scales poorly 
as the state space grows; a more balanced variation of 
DPOR, with a set of randomly selected paths as starting 
points, achieves the best coverage. 

This paper is organized as follows. We present an 
overview of MODIST (82), then describe its implemen- 
tation (83) and evaluation (84). Next we discuss related 
work (85) and conclude (86). 


2 Overview 


A typical distributed system that MODIST checks has 
multiple processes,* each running multiple threads. 
These processes communicate with each other by send- 
ing and receiving messages through socket connections. 
MODIST can re-order messages and inject failures to 
simulate an asynchronous and unreliable network. The 
processes may write data to disk, and MODIST will 
generate different possible crash scenarios by permuting 
these disk writes. 

The remainder of this section gives an overview of 
MODIST, covering its architecture (§2.1), its checking 
process(§ 2.2), the checks it enables (8 2.3), and its user 
interface (82.4). 


2.1 Architecture 


Figure | illustrates the architecture of MODIST applied 
to a 4-node distributed system. The master node runs 
multiple threads (the curved lines in the figure) and might 
send or receive messages (the solid boxes). For each 
process in the target system, MODIST inserts an inter- 
position frontend between the process and its native op- 
erating system to intercept and control non-deterministic 
decisions involving thread and network operations. 
MODISsT further employs a backend that runs in a dif- 
ferent address space and communicates with the fron- 
tends via RPC. This design minimizes MODIST’s pertur- 
bation of the target system, allowing us to build a generic 
backend that runs on a POSIX-compliant operating sys- 
tem, and makes it possible to build the frontends for 
MODIST on different operating systems. The backend 


*In this paper we use node and process interchangeably. 
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Figure 1: MODIST architecture. All MODIST compo- 
nents are shaded. The target system consists of one mas- 
ter, two replication nodes, and one client. MODIST’s 
frontend interposes between each process in the target 
system and the operating system to intercept and control 
non-deterministic actions, such as message interleaving 
and thread interleaving. MODIST’s backend runs in a 
separate address space to schedule these actions. 


Model Checking 


Global Assertion 


consists of five components: a dependency tracker, a fail- 
ure simulator, a virtual clock manager, a model checking 
engine, and a global assertion checker. 


Interposition. MODIST’s interposition frontend is a 
thin layer that exposes what actions a distributed sys- 
tem can do and lets MODIST’s backend deterministically 
schedule them. Specifically, it does so in two steps: (1) 
when the target system is about to execute an action, 
the frontend pauses it and reports it to the backend; and 
(2) upon the backend’s command, the frontend either re- 
sumes or fails the paused action, turning the target sys- 
tem into a “puppet” of the backend. 

We place the interposition layer at the OS-application 
boundary to avoid modifying either the target system or 
the underlying operating system. In addition, despite 
variations in OS-application interfaces, they provide sim- 
ilar functions, allowing us to build a generic backend. 

Since the interposition layer runs inside the target sys- 
tem, we explicitly design it to be simple and mostly state- 
less, and leave the logic and the state in the backend, 
thereby reducing the perturbation of the target system. 


Dependency ‘Tracking. MODIST’s dependency 
tracker oversees how actions interfere with each other. It 
uses these dependencies to compute the set of enabled 
actions, 1.e., the actions, if executed, that will not block 
in the OS. For example, a recv () is enabled if there is 
a message to receive, and disabled otherwise. The model 
checking engine (described below) only schedules en- 
abled actions, because scheduling a disabled action will 
deadlock the target system (analogous to a cooperative 
thread scheduler scheduling a blocked thread). 


Failure Simulation. MODIST’s failure simulator or- 
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# command working dir inject failure? 


master.exe ./master/ 1 
node.exe ./nodel/ 1 
node.exe ./node2/ 1 
client.exe testl ./client/ 0 


Figure 2: A configuration file that spawns the distributed 
system in Figure |. We used this file to check PACIFICA. 


chestrates many rare events that may occur in a dis- 
tributed system, including message reordering, mes- 
sage loss, network partition, and machine crashes; these 
events can expose bugs in the often-untested failure han- 
dling code. The failure simulator lets MODIST inject 
these failures as needed, consistently to avoid false posi- 
tives, and deterministically to let users reliably reproduce 
errors (cf. 82.2). 


Virtual Clock. MODIST’s virtual clock manager has 
two main functions: (1) to discover timers in the target 
system and fire them as requested by MODIST’s model 
checking engine to trigger more bugs, and (2) to ensure 
that all processes in the target system observe a consis- 
tent clock to avoid false positives. Since the clock is vir- 
tual, MODIST can “fast forward” the clock as needed, 
often making a checked execution faster than a real one. 


Model Checking. MODIST’s model checking engine 
acts as an “omnipresent” scheduler of the target system. 
It systematically explores a distributed system’s execu- 
tions by enumerating the actions, failures, and timers ex- 
posed by the other MODIST components. It uses a set 
of search heuristics and state-space reduction techniques 
to improve the efficiency of its exploration. We elabo- 
rate the model checking process in next section and the 
search strategies in 83.6. 


Global Assertion. MODIST’s global assertion mecha- 
nism lets users check distributed properties on consistent 
global snapshots; these properties cannot be checked by 
observing only the local states at each individual node. 
Its implementation leverages our previous work [25]. 


2.2 Checking Process 


With all MODIST’s components in place, we now de- 
scribe MODIST’s checking process. To begin checking a 
distributed system, the user only needs to prepare a sim- 
ple configuration file that specifies how to start the target 
system. Figure 2 shows a configuration file for the 4- 
node replication system shown in Figure 1; it is a real 
configuration that we used to check PACIFICA. Each line 
in the configuration tells MODIST how to start a process 
in the target system. A typical configuration consists 
of 2 to 10 processes. The “inject failure” flag is useful 
when users do not want to check failures for a process. 
For example, client .exe is an internal test program 


init_state = checkpoint(create_init_state()); 
q.enqueue(init_state, init_state.actions); 


while(!q.empty()) { 
<state, action> = q.dequeue(); 


try { 
next_state = checkpoint(action(restore(state))); 


global_assert(next_state); //check user-provided global assertions 


if (next_state has never been seen before) 
q.enqueue(next_state, next_state.actions); 
} catch (Error e) { 
// save trace and report error 


Figure 3: Model checking pseudo-code. 


that does not handle any failures, so we turned off failure 
checking for this process. 


With a configuration file, users can readily start check- 
ing their systems by running modist <config>. 
MODIST then instruments the executables referred to in 
the configuration file to interpose between the applica- 
tion and the operating system, and starts its model check- 
ing loop to explore the possible states and actions in the 
target system: a state 1s an instantaneous snapshot of the 
target system, while an action can be to resume a paused 
WinAPI function via the interposition layer, to inject a 
failure via the failure simulator, or to fire a timer via the 
virtual clock manager. 


Figure 3 shows the pseudo-code of MODIST’s model 
checking loop. MODIST first spawns the processes 
specified in the configuration to create an initial state, 
and adds all (initial state,action) pairs to a state queue, 
where action is an action that the target system can do in 
the initial state. Next, MODIST takes a (state, action) 
pair off the state queue, restores the system to state, 
and performs action. If the action generates an error, 
MODIST will save a trace and report the error. Other- 
wise, MODIST invokes the user-provided global asser- 
tions on the resultant global state. MODIST further adds 
new State/action pairs to the state queue based on one of 
MODIsT’s search strategies (cf. 83.6 for details.) Then, 
it takes off another (state, action) pair and repeats. 


To implement the above process, MODIST needs to 
checkpoint and restore states. It uses a stateless ap- 
proach [12]: it checkpoints a state by remembering the 
actions that created the state and restores it by redoing all 
the actions. Compared to a stateful approach that check- 
points a state by saving all the relevant memory bits, a 
stateless approach requires little modifications to the tar- 
get system, as previous work has shown [12, 19, 28, 39]. 
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2.3 Checks 


The checks that MODIST performs include generic 
checks that require no user intervention as well as user- 
written system-specific checks. 

Currently, MODIST detects two classes of generic er- 
rors. The first is “fail-stop” errors, which manifest them- 
selves when the target system unexpectedly crashes in 
the absence of an injected crash from MODIST. These 
crashes can be segmentation faults due to memory er- 
rors or program aborts because MODIST has brought the 
target system into an erroneous state. MODIST detects 
these unexpected crashes by catching the corresponding 
signals. The second is “divergence” errors [12], which 
manifest themselves when the target system deadlocks 
or goes into an infinite loop. MODIST catches these er- 
rors using timeouts. When MODIST schedules one of the 
actions of the target system, it waits for a user-specified 
timeout interval (10 seconds by default) until the target 
system gets back to it; otherwise, MODIST will flag a 
divergence error. 

Because MODIST checks the target system by execut- 
ing it, MODIST can easily check the effects of real ex- 
ecutions and find errors. Thus, we can always combine 
MODIST with other dynamic error detection tools (e.g., 
Purify [16] and Valgrind [29]) to check more generic 
properties; we leave these checks for future work. 

In addition to generic checks, MODIST can perform 
system-specific checks via user-provided assertions, in- 
cluding local assertions (via the assert () statements) 
inserted into the target system and global assertions that 
run in the centralized model checking engine. Given 
these assertions, MODIST will amplify them by driving 
the target code into many possible states where these as- 
sertions may fail. In general, the more assertions users 
add, the more effective MODIST will be. 


2.4 Advanced User Interface 


As with most other automatic error detection tools, the 
more system-specific knowledge MODIST has, the more 
effective it will be. For users who want to check their 
system more thoroughly, MODIST provides the follow- 
ing methods for incorporating domain knowledge. 

Users can add more program assertions in the code for 
a more thorough check. In addition to these local asser- 
tions, users can enrich the set of checks by specifying 
global assertions in MODIST. These assertions check 
distributed properties on any consistent global snapshot. 

Users can make MODIST more effective by reducing 
their system’s state space. A simple trick is to bound the 
number of failures MODIST injects per execution. Our 
previous work [38, 39] showed that tricky bugs are often 
caused by a small number of failures at critical moments. 
Obviously, without bounds on the number of failures, a 
distributed system may keep failing without making any 
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progress. In addition, developers tend to find bugs trig- 
gered by convoluted failures uninteresting [38]. 

Users can provide hints to let MODIST focus on the 
states (among an infinite number of states) that users con- 
sider most interesting. Users can do so in two ways: (1) 
extend one of MODIST’s search algorithms through the 
well-defined state queue interface, and (2) construct a 
test case to test some unusual parts of the state space. 


3 Implementation 


We implemented MODIST on Windows by intercepting 
calls to WinAPI [36], the Windows Application Pro- 
gramming Interface. We chose WinAPI because it is 
the predominant programming interface used by almost 
all Windows applications and libraries, including the de- 
fault POSIX implementation on Windows. While we 
built MODIST on Windows, we expect that porting to 
other operating systems, such as Linux, BSD, and So- 
laris, should be easy because WinAPI is more compli- 
cated than the POSIX API provided by most other oper- 
ating systems. For example, WinAPI has several times 
as many functions as POSIX. Moreover, many WinAPI 
functions operate in both synchronous and asynchronous 
mode, and the completion notifications of asynchronous 

IO (AIO) may be delivered through several mechanisms, 

such as events, select, or IO completion ports [36]. 

When we implemented MODIST we tried to adhere to 
the following two goals: 

1. Consistent and deterministic execution. The ex- 
ecutions MODIST explores and the failures it in- 
jects should be consistent and deterministic to 
avoid difficult-to-diagnose false positives and non- 
deterministic errors. 

2. Tailor for distributed systems. We explicitly designed 
MODIST to check distributed systems. Having this 
goal in mind, we customized our implementation for 
distributed systems and avoided being overly general. 

These goals were reflected at many places in our im- 

plementation. In the rest of this section, we describe 

MODIST’s implementation in details, highlighting the 

decisions entailed by these goals. 


3.1 Interposition 


MODIsT’s interposition layer transparently intercepts 
the WinAPI functions in the target system and allows 
MODIST’s backend to control it deterministically. There 
are two main issues regarding interposition. First, inter- 
position complexity: since the interposition layer runs in- 
side the address space of the target system, it should be as 
simple as possible to avoid perturbing the target system, 
or introducing inconsistent or non-deterministic execu- 
tions. Second, JO abstraction: as previously mentioned, 
WinAPI is a wide interface with rich semantics; Win- 
dows networking IO is particularly complex. To avoid 
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Category #offunctions #of LOC 
Network 28 1816 
Time 7 161 
File System 9 640 
Mem =) 126 
Thread 33 1433 
Shared 1290 
Total 82 5466 


Table 1: Interposition complexity. This table shows the 
lines of code for WinAPI wrappers, broken down by cat- 
egories. The “Shared” row refers to the code shared 
among all API categories. Most wrappers are fairly small 
(67 lines on average). 


excessive complexity in MODIST’s backend, the interpo- 
sition layer should abstract out the semantics irrelevant to 
checking and abstract the WinAPI networking interface 
to a simpler form. 


Interposition complexity. To reduce the interposition 
complexity, we implemented the interposition layer us- 
ing the binary instrumentation toolkit from our previous 
work [25]. This toolkit takes a list of annotated WinAPI 
functions we want to hook and automatically generates 
much of the wrapper code for interposition. Under the 
hood, it intercepts calls to dynamically linked libraries 
by overwriting the function addresses in relocation tables 
(import tables in Windows terminology). 

Since we check distributed systems, we only need 
to intercept WinAPIs relevant to these systems. Ta- 
ble 1 shows the categories of WinAPIs we currently 
hook: (1) networking APIs, such as WSARecv () 
(receiving a message), for exploring network condi- 
tions; (2) time APIs, such as GetSystemTime(), 
for discovering timers; (3) file system APIs, such as 
WriteFile() and FlushFileBuffers(), for in- 
jecting disk failures and simulating crashes, (4) memory 
APIs, such as malloc(), for injecting memory fail- 
ures; and (5) thread APIs, such as CreateThread () 
and SetEvent (), for scheduling threads. 

Most WinAPI wrappers are simple: they notify 
MODIST’s backend about the WinAPI calls using an 
RPC call, wait for the reply from the backend, and, 
upon receiving the reply, they either call the underlying 
WinAPIs or inject failures. Table 1 shows the total lines 
of code in all manually-written wrappers. Each wrapper 
on average consists of only 67 lines of code. 


IO abstraction. Controlling the Windows networking 
IO interface is complex for three reasons: (1) there are 
many networking functions; (2) these functions heavily 
use AIO, whose executions are hidden inside the kernel 


and not exposed to MODIST; and (3) these functions 
may produce non-deterministic results due to failures 
in the network. We addressed these issues using three 
methods: (1) abstracting similar network operations into 
one generic operation to narrow the networking IO in- 
terface, (2) exposing AIO to MODIST by running it syn- 
chronously in a proxy thread, and (3) carefully placing 
error injection points to avoid non-determinism. 

To demonstrate our methods, we show in Figure 4 the 
wrapper for WSARecv(), a WinAPI function to syn- 
chronously or asynchronously receive data from a socket. 
For simplicity, we omit error-handling code and assume 
AIO completion is delivered using events only (events 
are similar to binary semaphores.) 

Our wrapper first checks whether the network con- 
nection represented by the socket argument s is already 
broken by MODIST (line 5-8). If so, it simply returns 
an error to avoid inconsistently returning success on a 
broken socket. It then handles AIO (line 9-24) by cre- 
ating a generic network IO structure net_io (line 10- 
14), hijacking the application’s IO completion event (line 
16-18), spawning a proxy thread (line 21), and issuing 
the AIO to the OS (line 23). The proxy thread will 
invoke function mc: :net_io::run() (line 29-55). 
This function first notifies MODIST about the IO (line 
34). Upon MODIST’s reply, it either injects a failure 
(line 36-40), or waits for the OS to complete the IO 
(line 40-51). Function run() then reports the IO re- 
sult to MODIST, which in this example is the length of 
the data received (47-50). Finally, it calls the wrapper to 
SetEvent () to wake up any real threads in the target 
system that are waiting for the IO to complete. 

This wrapper example demonstrates the abstraction 
we use between MODIST’s interposition frontend and 
the backend. A network IO is split into an i0_issue 
and an io_result RPC. The first RPC, io_issue, 
expresses the IO intent of the target system to MODIST 
before it proceeds to a potentially blocking IO, letting 
MODIST avoid scheduling a disabled (1.e., blocked) IO. 
Its second purpose is to serve as a failure injection point. 
The second RPC, io_result, lets MODIST update the 
control information it tracks. 

These RPC methods take the message sizes and the 
network connections as arguments, but not the spe- 
cific message buffers or sockets, which may change 
across different executions. This approach ensures that 
MODIST’s backend sees the same RPC calls when it re- 
plays the actions to recreate the same state as when it 
initially created the state. If MODIST detects a non- 
deterministic replay (e.g.,a WSARecv () receives fewer 
bytes than expected), it will retry the IO by default. 

There are two additional nice features about our IO 
abstraction: (1) it allows wrapper code sharing and 
therefore reduces the interposition complexity (Table 1, 
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: // the OS uses \pOverlap to deliver IO completion from [12]: two actions ae dep endent it wae can enable 
: int mc-WSARecv(SOCKET s, LPWSABUF buf, DWORD nbuf, OF disable the other or if executing them in a different 


..., LPWSAOVERLAPPED IpOverlap, ...) { order leads to a different state. MODIST uses these de- 

# check if MODIST has broken this connection pendencies to avoid false deadlocks (described below), to 
WS ASetLastError(WSAENETRESET): simulate failures (83.3), and to reduce state space (83.6). 
return SOCKET_ERROR: To avoid false deadlocks, MODIST needs to compute 


the set of enabled actions that will not block in the OS. 


1 
2 
3 
4 
5: if(mc_socket_is_broken(s)) { 
6 
i 
8 
g For determinism, MODIST schedules one action at a time 


} 
if(overlap) { // Asynchronous mode 


10: mc::net_io “io = ...; . . 
ie io-GHs TpOvellap — AO veda: and pauses all other actions (cf. 82.2). If MODIST in- 
12: io—>op = me::RECV_MESSAGE; // set IO type correctly schedules a disabled action (such as a blocking 
13: io—>connection = ...; // Identify connection using WSARecv () ), it will deadlock the target system because 
- UO OUIGE a2) OT GM CERES P02 the scheduled action is blocked in the OS while all other 
16: // Hijack application’s IO completion notification event actions are paused by MODIST. 

17: io—>orig_event = lpOverlap—>hEvent; Since the dependency tracker tries to infer whether 
: IpOverlap—>hEvent = io—>proxy_event; the OS scheduler would block a thread in a WinAPI 
30. Wi piesa pany dread ant ee ene an call (recall that the interposition layer exposes AIOs as 
94: io—>start_proxy_thread(); threads), it unsurprisingly resembles an OS scheduler 
22: // Issue asynchronous receive to the OS and replicates a small amount of the control data in the 
20. return ::WSARecv(s,buf,nbuf,. . .,i0o—>proxy_IpOverlap,.--); QS and the network. To illustrate how it works, con- 
24: } ; 
a sider the WSARecv () wrapper in Figure 4. The de- 
6: .. pendency tracker will track precisely how many bytes 
Zi} are sent and received for each network connection us- 
28: // mc::net_io code is shared among all networking IO ing the io_result RPC (line 50). If a thread tries to 
29: void mc::net_io::run() {// called by proxy thread line 34) wh ‘lable. th 
30: me::rpc_client *rpe = mc::current_thread_rpc_client(); TERE es TEER ee (line )w a pone oP i ae NOS 
31: dependency tracker will mark this thread as disabled and 


32: // This RPC blocks this thead. It returns only when MODIST place 1t on the wait queue of the connection. Later, when 


33: // wants to (1) inject a failure, or (2) complete the IO a WSASend () occurs at the other end of the connec- 
34: int ret = rpc—>io_issue(this—>op, this—>connection); 


35: tion, the dependency tracker will remove this thread from 
36: if(ret == mc::FAILURE) { the wait queue and mark it as enabled. When MODIST 
37: // MODIST wants to inject a failure schedules this thread by replying to its RPC io_issue, 
38: this—>orig_lpOverlap—>Internal // Fake an IO failure 

30. - STATUS_CONNECTION RESET: the thread will not block at line 45 because there is data 
AQ: i. PA ie OS tone ie JO to receive. In addition to network control data, the depen- 
41:  } else { // MODIST wants to complete this IO dency tracker also tracks threads, locks, and semaphores. 
42: // Wait for the OS to actually complete the IO, because the . . . 

43: // data to receive may still be in the real network. 3.3. Failure Simulation 

44: // This wait will not block forever, since MODIST’s j : 
45: // dependency tracker knows there are bytes to receive When eeduesten by the model checking cae 
AG: ::WaitForSingleObject(this—>proxy_event, INFINITE); MODIstT’s failure simulator injects five categories of 
47: failures: API failures (e.g., WriteFile() returns 
48: // Report the bytes actually sent or received, so MODIST’s “disk error’), message reordering,* message loss, net- 
49: // dep. tracker knows how many bytes are in the network. k ae 4 ia i lag API 
50: int msg_size = this—>orig_lpOverlap—>InternalHigh; — Pp arWutons; ig Mace eres oe ating 

Bile tpc—>io_result(this—>op, this—>connection, msg_size); failures is the easiest: MODIST simply tells the interpo- 
02: } sition layer to return an error code. Reordering messages 


93: / deliver IO notification to application. mc_SetEvent is is also easy since the model checking engine already ex- 
94:  // a wrapper to WinAPI SetEvent; 


55:  me_SetEvent(this—>orig_event): plores different orders of actions. To simulate different 
56: } crash scenarios, we used techniques from our previous 
work [38, 39] to permute the disk writes that a system 
issues. 

Simulating network failures is more complicated due 
to the consistency and determinism requirement. We first 
tried a naive approach: simply closing sockets to simu- 
late connection failures. This approach did not work well 
3.2 Dependency Tracking because we frequently experienced inconsistent failures: 


Figure 4: Simplified WSARecv () wrapper. 


“Shared” row), (2) it abstracts away the OS-specific fea- 
tures and enables the backend to be OS-agnostic. 


MODIST’s dependency tracker monitors how actions *Message reordering is not a failure, but since it is often caused by 
might affect each other. The notion of dependency is abnormal network delay, for convenience we consider it as a failure. 
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the “macro” failures we want to inject (e.g., network par- 
tition) map to not one but a set of “micro” failures we 
can inject through the interposition layer (e.g., a failed 
WSARecv () ). For example, to break a TCP connection, 
we must carefully fail all pending asynchronous [Os as- 
sociated with the connection at both endpoints. Other- 
wise, the target system may see an inconsistent connec- 
tion status and crash, thus generating a false positive. 

We also frequently experienced non-deterministic 
failures because the OS detects failures using non- 
deterministic timeouts. Consider the following ac- 
tions: 

1. Process P; calls WSASend(P), message). 

2. Process P calls asynchronous WSARecv(P; ). 

3. MODIST breaks the connection between P; and P). 
P); may or may not receive the message, depending on 
when P)’s OS times out the broken connection. 

Our current approach ensures that failure simulation is 
consistent and deterministic as follows. We know the 
exact set of real or proxy threads that are paused by 
MODIST in rpc->io_issue() (Figure 4, line 34). 
To simulate a network failure, we inject failures to all 
these threads, and we do so immediately to avoid any 
non-deterministic kernel timeouts. Note that doing so in 
the example above will not cause us to miss the scenario 
where P, receives the message before the connection 
breaks; MODIST will simply explore this scenario in a 
different execution where it completes P,’s asynchronous 
WSARecv () first (by replying to P)’s i0_issue () 
RPC), and then breaks the connection between P; and 
P. 


3.4 Virtual Clock 


MODIsT’s virtual clock manager injects timeouts when 
requested by the model checking engine and provides 
a consistent view of the clock to the target system. A 
side benefit of virtual clock is that, the target system may 
run faster because the virtual clock manager can fast for- 
ward time. For example, when the target system calls 
sleep (1000), the virtual clock manager can add 1000 
to its current virtual clock and let the target system wake 
up immediately. 


Discovering Timeouts. To detect bugs in rarely tested 
timeout handling code, we want to discover as many 
timers as possible. This task is made difficult be- 
cause system code extensively uses implicit timers where 
the code first gets the current time (e.g., by calling 
gettimeofday ()), then checks if a timeout occurs 
(e.g., using an if-statement). Figure 5 shows a real ex- 
ample in Berkeley DB. 

Since implicit timers do not use OS APIs to check 
timeouts, they are difficult to discover by a model 
checker. Previous work [19, 27, 38] requires users to 
manually annotate implicit timers. 


// db-4.7.25.NC/repmgr/repmgr_sel.c 
int __repmgr_compute_timeout(ENV “env, timespec * timeout) 
{ 
db_timespec now, t; 
// Set t to the first due time. 
if (have_timeout) { 
__os_gettime(env, &now, 1); // Query current time. 


if (now >= t) / Timeout check, immediately follows the query. 


“timeout = 0; // Timeout occurs. 
else 
“timeout = t — now; / No timeout. 


Figure 5: An implicit timer in Berkeley DB (after macro 
expansion and minor editing). 


To discover implicit timers automatically, we devel- 
oped a static symbolic analysis technique. It is based on 
the following two observations: 


1. Programmers use time in simple ways. For ex- 
ample, they explicitly label time values (e.g., 
db_timespec in Figure 5), they do simple arith- 
metic on time values, and they generally do not cast 
time values to pointers and other unusual types. This 
observation implies that simple static analysis is suf- 
ficient to track how a time value flows. 

2. Programmers check timeouts soon after they query 
the current time. The intuition is that programmers 
want the current time to be “fresh” when they check 
timeouts. This observation implies that our analysis 
only needs to track a short flow of a time value (e.g., 
within three function calls) and may stop when the 
flow becomes long. 


We analyzed how time values are used in Berkeley DB 
version 4.7.25. We found that Berkeley DB mostly uses 
“4.7 “— and occasionally “*’’ and “/’ (for conversions, 
e.g., from seconds to milliseconds). In 12 out of 13 im- 
plicit timers, the time query and time check are within a 
few lines. 


Our analysis resembles symbolic execution [3, 4, 13, 
31]. It has three steps: (1) statically analyze the code 
of the target system and find all system calls that re- 
turn time values; (2) track how the time values flow to 
variables; and (3) upon a branch statement involving a 
tracked time value, use a simple constraint solver to gen- 
erate symbolic values to make both branches true. To 
show the idea, we use a source code instrumentation ex- 
ample. In Figure 5, our analysis can track how time flows 
from “__os_gettime”’to“if (now >= t),’ andre- 
place the “__os_gettime” line in Figure 5 with 


mc::rpe_client “rpc = mc:?:current_thread_rpc_client() 
now = rpc—>gettime(/*timer=*/); 
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This RPC call tells the virtual clock manager that a timer 
fires at t; the virtual clock manager can then return a 
time value smaller than t for one execution, and greater 
than t for another execution, to explore both possible 
execution paths. We implemented our analysis using the 
Phoenix compiler framework [30]. 

Since our analysis is static, it avoids the runtime over- 
head of instrumenting each load and store for tracking 
symbolic values and thus is much simpler than dynamic 
symbolic execution tools [3, 4, 13, 31], which often take 
iterations to become stable [3, 4]. Note our analysis is 
unsound, as with other symbolic analysis tools, in that 
it may miss some timers and thus miss bugs. However, 
it will not introduce false positives because the virtual 
clock manager ensures the consistency of time. 


Ensuring Consistent Clock. A consistent clock is cru- 
cial to avoid false positives. For example, the safety of 
the lease mechanism [14] requires that the lessee time- 
outs before the lessor; reversing the order may trigger 
“bugs” that never occur in practice. We actually encoun- 
tered a painful false positive due to a violation of this 
safety requirement when checking PACIFICA. 

To maintain consistent time, the virtual clock manager 
sorts all timers in the target system from earliest to last 
based on when these timers will fire. When the model 
checking engine decides to fire a timer, it will systemat- 
ically choose one of several timers that fall in the range 
of [7,7 +E], where T is the earliest timer and E is a 
configurable clock error allowed by the target system. 
This mechanism lets MODIST explore interesting timer 
behaviors while not deviating too much from real timer- 
triggered executions. 


3.5 Global Assertion 


We have implemented global assertions leveraging our 
previous work D?S [25]. D°S enables transparent predi- 
cate checking of a running distributed system. It provides 
a simple programming interface for developers to spec- 
ify global assertions, interposes both user-level functions 
and OS system calls in the target system to expose its 
runtime state as state tuples, and collects such tuples as 
globally consistent snapshots for evaluating assertions. 
To use D?S, developers need to specify the functions be- 
ing interposed, the state tuples being retrieved from func- 
tion parameters, and a sequential program that takes a 
complete state snapshot as input to evaluate the predi- 
cate. D°S compiles such assertions into a state exposing 
module, which is injected into all processes of the target 
system, and a checking module, which contains the eval- 
uation programs and outputs checking results for every 
constructed snapshot. 

MODIST incorporates D*S to enable global asser- 
tions, with two noticeable modifications. First, we sim- 
plify D*S by letting each node transmit state tuples syn- 
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chronously to MODIST’s checking process, which ver- 
ifies assertions immediately. Previously, because nodes 
may transmit state tuples concurrently, D*S must, before 
checking assertions, buffer each received tuple until all 
tuples causally dependent before that tuple have been re- 
ceived. Since MODIST runs one action at a time, it no 
longer needs to buffer tuples. Second, while D?S uses a 
Lamport clock [23] to totally order state tuples into snap- 
shots, MODIST uses a vector clock [26] to check more 
global snapshots. 


3.6 State Space Exploration 


MODIST maintains a queue of the state/action pairs to 
be explored. Due to the complexity of a distributed sys- 
tem, itis often infeasible for MODIST to exhaust the state 
space. Thus, it is key to decide which state/action pairs 
to add to the queue and the order in which they are ex- 
plored. 

MODIST tags each action with a vector clock and im- 
plements a customizable modular framework for explor- 
ing the state space so different reduction techniques and 
heuristics can be incorporated. This is largely inspired by 
our observation that the effectiveness of various strate- 
gies and heuristics is often application-dependent. 

The basic state exploration process is_ simple: 
MODIST takes the first state/action pair (s,a) from the 
queue, steers the system execution to state s if that is not 
the current state, applies the action a, reaches a new state 
s’, and examines the new resulting state for errors. It then 
calls a customizable function explore, which takes the 
entire path from the initial state to s and then s’, where 
each state is tagged with its vector clock and each state 
transition is tagged with the action corresponding to the 
transition. For s’, all enabled actions are provided to the 
function. The function then produces a list of state/action 
pairs and indicates whether the list should be added to the 
front of the queue or the back. MODIST then inserts the 
list into the queue and repeats the steps. 

MODIST has a natural bias towards exploring (s,qa) 
pairs where s is the state MODIST is in. This default 
strategy will save the cost of replaying the trace to reach 
the state in the selected state/action pair. 

Now we show how various state exploration strate- 
gies and heuristics can be implemented in the MODIST 
framework. 

Random. Random exploration with a bounded maxi- 
mum path length explores a random path up to a bounded 
path length and then starts from the initial state for an- 
other random path. The explore function works as fol- 
lows: if the current path has not exceeded the bound, 
the function will randomly pick an enabled action a’ at 
the new state s’ and has (s’,a’) inserted to the end of 
the queue (note that the queue is empty). If the current 
path has reached the bound, the function will randomly 
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choose an enabled action do in the initial state so, and has 
(59,4) inserted to the end of the queue. 

DFS and BFS. For Depth First Search (DFS) and 
Breadth First Search (BFS), the explore function sim- 
ply inserts (s’,a’) for every enabled action a’ in state s’. 
For DFS, the new list is inserted at the front of the queue, 
while for BFS at the back. Clearly, DFS is more attrac- 
tive since MODIST does not have to replay traces often 
to recreate states. 

DPOR. For dynamic partial order reduction (DPOR), the 
explore function works as follows. Let a be the last 
action causing the transition from s to s’. The function 
looks at every state s, before s on the path and the ac- 
tion a, taken at that state. If a is enabled at s, (ie., if s 
and s, are concurrent judged by the vector clocks) and a 
does not commute with a, (1.e., the different orders of the 
two actions could lead to different executions), we record 
(Sp,a) in the list of pairs to explore. Once all states are 
examined, the function returns the list and has MODIST 
insert the list in the queue. 

By specifying how the list is inserted, the function 
could choose to use DFS or BFS on top of DPOR. Also, 
by ordering the pairs in the list differently, MODIST will 
be instructed to explore the newly added branches in dif- 
ferent orders (e.g., top-down or bottom-up). The default 
is DFS again to avoid the cost of recreating states. We 
further introduce Bounded DPOR to refer to the varia- 
tion of DPOR with bounds on DFS for a more balanced 
state-space exploration. 

The explore function can be constructed to favor cer- 
tain actions (e.g., crash events) over others, to bound the 
exploration in various ways (e.g., the path length and the 
number of certain actions on the path), and to focus on a 
subset of possible actions. 


4 Evaluation 


We have applied MODIST to three distributed systems: 
(1) Berkeley DB, a widely used open-source database 
(a version with replication); (2) MPS, a closed source 
Paxos [22] implementation built by a Microsoft product 
team and has been deployed in commercial data centers 
for more than two years; and (3) PACIFICA, a mature 
implementation of a primary-backup replication proto- 
col we developed. We picked Berkeley DB and MPS 
because of their wide deployment and importance and 
PACIFICA because it provides an interesting case study 
where the developers apply model checking to their own 
systems. 

Table 2 summarizes the errors we found, all of which 
are previously unknown bugs. We found a total of 35 
errors, 10 of which are protocol-level bugs that occur 
only under rare interleavings of messages and crashes; 
these bugs reflect flaws in the underlying communica- 
tion protocols of the systems. /mplementation bugs are 


System KLOC Protocol Impl. Total 
Berkeley DB 172.1 2 5 7 
MPS a5) 2 1] 13 
PACIFICA IZ 6 9 15 
Total 237.6 10 2) 3D 


Table 2: Summary of errors found. The KLOC (thou- 
sand lines of code) column shows the sizes of the systems 
we checked. We separate protocol-level bugs (Protocol) 
and implementation-level bugs (Impl.), in addition to re- 
porting the total (Total). 31 of the 35 bugs have been 
confirmed by the developers. 


those that can be caused by injecting API failures. All 
MPS and PACIFICA bugs were confirmed by the devel- 
opers. Three out of seven Berkeley DB bugs, includ- 
ing one protocol-level bug, were confirmed by Berkeley 
DB developers; we are having the rest confirmed. These 
unconfirmed bugs are likely real bugs because we can 
reproduce them without MODIST by manually tweak- 
ing the executions and killing processes according to the 
traces from MODIST. 

While other tools (e.g., a static analyzer) can also find 
implementation bugs, MODIST has the advantage of not 
generating false positives. In addition, it can expose the 
effects of these bugs, helping prioritize fixing. 

In the rest of this section, we describe our error detec- 
tion methodology, the bugs we found, MODIST’s cov- 
erage results and runtime overhead, and the lessons we 
have learned. 


4.1 Experimental Methodology 


Test driver. Model checking is most effective at check- 
ing complicated interactions between a small number of 
objects. Thus, in all tests we run, we use several pro- 
cesses servicing a bounded number of requests. Since 
the systems we check came with test cases, we simply 
use them with minor modifications. 

Global assertions. By default, MODIST checks fail- 
stop errors. To check the correctness properties of a dis- 
tributed system, MODIST supports user supplied global 
assertions (§2). For the replication systems we checked, 
we added two types of assertions. The first type was 
global predicates for the safety properties. For example, 
all replicas agree on the same sequence of commands. 
The second type of predicates check for liveness. True 
liveness conditions cannot be checked by execution mon- 
itoring sO we instead approximate them by checking for 
progress in the system: we expect the target system to 
make progress in the absence of failures. In the end, we 
did not find any bug that violated the safety properties in 
any of the systems, probably reflecting the relative ma- 
turity of these systems. However, we did find bugs that 
violated liveness global assertions in every system. 
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Search strategy. MODIST has a set of built-in search 
strategies; no single strategy works the best. We have 
combined these strategies in our experiments for discov- 
ering bugs effectively. For example, we can first perform 
random executions (Random) on the system and inject 
the API failures randomly to get the shallow implemen- 
tation bugs. We can then use the DPOR strategy with 
randomly chosen initial paths to explore message orders 
systematically. We can further add crash and recovery 
events on top of the message interleaving, starting from 
a single crash and gradually increasing the number of 
crashes, to exercise the system’s handling of crash and 
recovery. We can run these experiments concurrently and 
fine-tune the strategies. 

Terminology. Distributed systems use different termi- 
nologies to describe the roles the nodes play in the sys- 
tems. In this paper, we will use primary and secondary 
to distinguish the replicas in the systems. They are called 
master and client respectively in Berkeley DB docu- 
ments. In the Paxos literature, a primary is also called 
a leader. 


4.2 Berkeley DB: a Replicated Database 


Berkeley DB is a widely used open source transactional 
storage engine. Its latest version supports replication for 
applications that must be highly available. In a Berkeley 
DB replication group, the primary supports both reads 
and writes while secondaries support reads only. New 
replicas can join the replication group at any time. 

We checked the latest Berkeley DB production re- 
lease: 4.7.25.NC. We use ex_rep_mgr, an example ap- 
plication that comes with Berkeley DB as the test driver. 
This application manages its data using the Berkeley DB 
Replication Manager. Our test setup has 3 to 5 pro- 
cesses. They first run an election. Once the election com- 
pletes, the elected primary inserts data into the replicated 
database, reads it back, and verifies that it matches the 
data inserted. 


Results and Discussions. We found seven bugs in 
Berkeley DB: four were triggered by injecting API fail- 
ures, one was a dangling pointer error triggered by the 
primary waiting for multiple ACK messages simultane- 
ously from the secondaries, and the remaining two were 
protocol-level bugs, which we describe below. 

The first protocol-level bug causes a replica to crash 
due to an “unexpected” message. The timing dia- 
gram of this bug is depicted in Figure 6. Replica C 
is the original primary. Suppose a new election is 
launched, resulting in replica A becoming the new pri- 
mary. Replica A will broadcast a REP_NEWMASTER 
message, which means “I am the new primary.” After 
replica B receives this message, it tries to synchronize 
with the new primary and sends A a REP_UPDATA_REQ 
message to get the up-to-date data. Meanwhile, C 
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Figure 6: Timing Diagram of Message Exchanges in a 
Berkeley DB Replication Bug. 


processes REP_NEWMASTER by first broadcasting a 
REP_DUPMASTER message, which means “duplicate 
primary detected,’ and then degrading itself to a sec- 
ondary. Broadcasting a REP_DUPMASTER message 
is necessary to ensure that all other replicas know 
that C is not primary anymore. When A processes 
REP_DUPMASTER, it has to give up its primary role 
because it cannot make sure that it is the latest pri- 
mary. Soon A receives the delayed but not-outdated 
REP_UPDATA_REQ message from B. Replica A pan- 
ics at once, because such message should only be re- 
ceived by primary. Such panics occur whenever a de- 
layed REP_UPDATA_REQ message arrives at a recently 
degraded primary. 

The second protocol level bug is more severe: it causes 
permanent failures in leader election due to a primary 
crash when all secondaries believe they cannot be pri- 
maries. Suppose replica A is the original primary and is 
synchronizing data with secondaries B and C. Normally 
synchronization works as follows. A sends a REP_PAGE 
message with the modified database page to B and C. 
Upon receipt of this message, B and C transit to log re- 
covery state by setting the REP_F_RECOVER_LOG flag. 
A then sends a REP_LOG message with the updated log 
records. However, if A crashes before it sends REP__LOG, 
B and C will never be able to elect a new primary be- 
cause, in Berkeley DB’s replication protocol, a replica in 
log recovery is not allowed to be a primary. 


4.3 MPS: Replicated State Machine Library 


MPS is a practical implementation of a replicated state 
machine library. The library has been used for over two 
years in production clusters of more than 100K machines 
for maintaining important system metadata consistently 
and reliably. It consists of 8.5K lines of C++ code for 
the communication protocol, and 45K for utilities such 
as networking and storage. 

At the core of MPS is a distributed Paxos protocol for 
consensus [22]. The protocol is executed on a set of ma- 
chines called replicas. The goal of the protocol is to have 
replicas agree on a sequence of deterministic commands 
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Figure 7: The Timing Diagram of Message Exchange in 
MPS Bug 1. 


and execute the commands in the same sequence order. 
Because all replicas start with the same initial state and 
execute the same sequence of commands, consistency 
among replicas is guaranteed. 

The MPS consensus protocol is leader (primary) 
based. While the protocol ensures safety despite the ex- 
istence of multiple primaries, a single primary is needed 
for the protocol to make progress. A replica can act as 
a primary using a certain ballot number. A primary ac- 
cepts requests from clients and proposes those requests 
as decrees, where decree numbers indicate the positions 
of the requests in the sequence of commands that is going 
to be executed by the replicated state machine. A decree 
is considered committed when the primary gets acknowl- 
edgment from a quorum (often a majority) of replicas in- 
dicating that they have accepted and persistently stored 
the decree. 

If a replica receives a message that indicates that a 
decree unknown to the replica is committed, then the 
replica enters a earning phase, in which it learns the 
missing decrees from other replicas. 

When an existing primary is considered to have failed, 
a new primary can be elected. The new primary will use 
a higher ballot number and carry out a prepare phase to 
learn the decrees that could have been committed and 
ensure no conflicting decrees are proposed. For each 
replica, a proposal with a higher ballot number over- 
writes any previous proposal with lower ballot numbers. 

Our test setup consists of 3 replicas, proposing a small 
number of decrees. 


Results and Discussions. We found 13 bugs in MPS, 
11 are implementation bugs that crash replicas, and the 
other two bugs are protocol-level bugs. 

The first protocol-level bug reveals a scenario that 
leads to state transitions that are not expected by the de- 
velopers (as demonstrated by the assertion that rules out 
the transition). MPS has a simple set of states and state 
transitions. A replica is normally in a stable state. When 
it gets indication that its state is falling behind (1.e., miss- 
ing decrees), it enters a /earning state. In the learning 
state, it fetches the decrees from a quorum of replicas. 
Once it brings its state up to date with what it receives 
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Figure 8: The Timing Diagram of Message Exchange in 
MPS Bug 2. 


from a quorum of replicas, it checks whether it should 
become a primary: if the primary lease expires, then 
the replica will compete to be a primary by entering a 
preparing state; otherwise, it will return to a stable state. 
There is an assertion in the code (and also in the design 
document for MPS) that the state transition from stable 
to preparing is impossible. 

Figure 7 shows the MODIST-generated scenario that 
triggers the assertion failure. The following is a list of 
steps that lead to the violation. Consider the case where 
the system consists of three replicas A, B, and C, where 
any two of them form a quorum. Replica A enters the 
learning state because it realizes that it does not have 
the information related to some decree numbers. This 
could be due to the receipt of a message that indicates 
that the last committed decree number is at least k, while 
A knows only up to some decree number less than k. A 
then sends a status query to B and C. A receives the re- 
sponse from B and learns all the missing decrees. Since 
A and B form a quorum, A enters the stable state. C 
was the primary. C’s response to A status query was de- 
layed, and the primary lease becomes expired on A. At 
some later point, C’s response arrives. The implementa- 
tion will handle that message as if A were in the learning 
state. After A is done, it notices that the primary lease 
has expired and transitions into the preparing state, caus- 
ing the unexpected state transition. As a result, A crashes 
and reboots. 

The second protocol-level bug is a violation of a global 
liveness assertion. It is triggered during primary election 
under the following scenario: replica A has accepted a 
decree with ballot number 2 and decree number 1, while 
replica B only has ballot number 1, but accepted a decree 
of decree number 2. 

The following series of events lead to this problematic 
scenario: B is a primary with ballot number 1, it pro- 
poses a decree with decree number | and the decree is 
accepted by all replicas including A and B. It then pro- 
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poses another decree with decree number 2, which is ac- 
cepted only on B. B fails before A gets the proposal. A 
then becomes a primary with ballot number 2, learns the 
decree with decree number 1, re-proposes it with a ballot 
number 2. 

Figure 8 shows the timing diagram continuing from 
this scenario. B comes back, receives the prepare request 
from A, and sends a rejection to A because B thinks A 
is not up-to-date given that B has a higher decree num- 
ber. After getting the rejection, A enters a learning state. 
In the learning state, even if B returns the decree with 
decree number 2, A will reject it because it has a lower 
ballot number. A will consider itself up-to-date and enter 
the preparing state again with a yet higher ballot number. 
This continues as A keeps increasing its ballot number, 
but unable to have new decrees committed, triggering a 
liveness violation. 

The problem in this scenario is due to the inconsis- 
tency of the views on what constitutes a newer state be- 
tween the preparing phase and the learning phase: one 
view uses a higher ballot number, while the other uses 
a higher decree number. The inconsistency is exposed 
when one has a higher decree number, but a lower ballot 
number than the other. 


4.4 PACIFICA: a Primary-Backup Replication Pro- 
tocol 


PACIFICA [24] is a large-scale storage system for semi- 
structured data. It implements a Primary-Backup pro- 
tocol for data replication. We used MODIST to check 
an implementation of PACIFICA’s replication protocol. 
This implementation consists of 5K lines of C++ code 
for the communication protocol and 7K for utilities. 

PACIFICA uses a variety of familiar components in- 
cluding two-phase commit for consistent replica updates, 
perfect failure detection, replica group reconfiguration to 
handle node failures, and replica reconciliation for nodes 
rejoining a replica group. 

Our test setup for PACIFICA has 4 processes: 1 mas- 
ter that maintains global metadata, 2 replica nodes that 
implement the replication protocol, and 1 client that up- 
dates the system and drives the checking process. Fig- 
ure 2 shows the configuration file. 


Results and Discussions. We found 15 bugs in PACI- 
FICA: 9 are implementation bugs that cause crashes and 
6 are protocol-level bugs. We managed to find more 
protocol-level bugs in PACIFICA than in other systems 
for two reasons: (1) since we built the system, we could 
quickly fix the bugs MODIST found then re-run MODIST 
to go after other bugs; and (2) we could check more 
global assertions for PACIFICA. 

The most interesting bug we found in PACIFICA pre- 
vents PACIFICA from making progress. It is triggered by 
a node crash followed by a replication group reconfigu- 
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Figure 9: Partial order state coverage of different explo- 
ration strategies. 


ration. A primary replica keeps a list of prepared updates 
(1.e., updates that have been prepared on all replicas, but 
not yet committed); a secondary replica does not have 
this data structure. When a primary crashes, a secondary 
will try to take over and become the new primary. If the 
crash happens in the middle of a commit operation that 
leaves some commands prepared but not yet committed, 
the new primary will try to re-commit all prepared up- 
dates by sending the “prepare” messages to the remain- 
ing secondary replicas. Unfortunately, PACIFICA did not 
put these newly prepared updates into the prepared up- 
date list. This prevents all the following updates from 
getting committed because of a hole in the prepared up- 
date list. 


4.5 State Coverage 


To evaluate the state-space exploration strategies de- 
scribed in 83.6, we measured state coverage: the number 
of unique states a strategy could explore after running a 
fixed number of execution paths. We examined the cov- 
erage of two types of states: 

1. Partial order traces [12]. Since two paths with the 
same partial order are equivalent, the number of dif- 
ferent partial order traces provides an upper bound 
on the number of unique behaviors a strategy can ex- 
plore. 

2. Protocol states. These states capture the more impor- 
tant protocol behaviors of a distributed system. 

We did two experiments, both on MPS: one with a 
small partial order state space and the other with a nearly 
unbounded state space. These two state spaces give an 
idea of how sensitive the strategies are to state-space 
sizes. No crash was injected during the evaluation. 

In the first experiment, we made the state space small 
using a configuration of two nodes, each receiving up 
to two messages. Figure 9 shows the number of unique 
partial order traces with respect to the number of paths 
explored. (Note that both axes are in log scale.) DPOR 
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Figure 10: Protocol state coverage of different explo- 
ration strategies. 


shows a clear advantage: it exhausted all 115,425 traces 
after 134,627 paths (the small redundancy was due to an 
approximation in our DPOR implementation.) The Ran- 
dom strategy explored 6,614 unique traces or 5.7% of the 
entire state space after 200,000 paths. DFS is the worst: 
all the 200,000 paths were partial order equivalent and 
corresponded to only one partial order trace. 

In the second experiment, we used a nearly unbounded 
partial order state space with three MPS nodes send- 
ing and receiving an unbounded number of messages. 
We bounded the maximum decree (two decrees) and the 
maximum path length (40,000 actions) to make the exe- 
cution paths finite. Since the state space was large, it was 
unlikely that Random ever explored a partial order trace 
twice. As a result, DPOR behaved the same as Random. 
(This result is not shown.) 

While partial order state coverage provides an up- 
per bound on the unique behaviors a strategy explores, 
different partial order traces may still be redundant 
and map to the same protocol state. Thus, we fur- 
ther measured the protocol state coverage of different 
exploration strategies. We defined the protocol state 
of MPS as a tuple (state, ballot,decree), where the 
state could be initializing, learning, stable 
primary, or stable secondary.” 

Figure 10 shows the protocol states covered by the 
first 50,000 paths explored in each strategy, using the 
MPS configuration from the second experiment. DFS 
had the worst coverage: it found no new states after ex- 
ploring the first path. The reason 1s, when the state space 
is large, DFS tends to explore a large number of paths 
that differ only at the final few steps; these paths are 
often partial-order equivalent. DPOR performed almost 
equally badly: it found less than 30 protocol states. This 
result is not surprising for two reasons: (1) different par- 


“We also measured the coverage of global protocol states, which 
consist of protocol states of each node in a consistent global snapshot. 
The results were similar and not shown. 


tial order traces might correspond to the same protocol 
state and (2) DPOR is DFS-based, thus suffers the same 
problem as DFS when the state space is large. 

In Bounded DPOR, protocol-level redundancy is par- 
tially conquered by the bounds on backtracks. As 
shown in Figure 10, the protocol-level state coverage of 
Bounded DPOR was larger than that of DPOR by an or- 
der of magnitude, in the first 50,000 paths. 

Surprisingly, the Random strategy yielded better cov- 
erage than DFS, DPOR, and even Bounded DPOR. The 
reason is that Random is more balanced: it explores ac- 
tions anywhere along a path uniformly, therefore it has a 
better chance to jump to a new path early on and explores 
a different area of the state space. 

These results prompted us to develop a hybrid Random 
+ Bounded DPOR search strategy that works as follows. 
It starts with a random path and explores the state space 
with Bounded DPOR. We further bound the total num- 
ber of backtracks so that the Bounded DPOR exploration 
ends. Then, a new round of Bounded DPOR exploration 
starts with a new random path. Random + Bounded 
DPOR inherits both the balance of Random and the thor- 
oughness of DPOR to cover the corner cases. Both the 
round number of DPOR explorations and the bound of 
the total number of backtracks are customizable, reflect- 
ing a bias towards Random or towards DPOR. As shown 
in Figure 10, the Random + Bounded DPOR strategy 
with a round number 100 performed the best. 


4.6 Performance 


In our performance measurements, we focused on three 
metrics: (1) MODIST’s path exploration speed; (2) the 
speedup due to the virtual clock fast-forward; and (3) 
the runtime overhead MODIST adds to the target system, 
including interposition, RPC, and backend scheduling. 

We set up our experiments as follows. We ran 
MODIST with two different search strategies: RAN- 
DOM and DPOR. For each search strategy, we let 
MODIST explore 1K execution paths and recorded the 
running times. We repeated this experiment 50 times and 
took the average. We used Berkeley DB and MPS as 
our benchmarks, using identical configurations as those 
used for error detection. We ran our experiments on a 64- 
bit Windows Server 2003 machine with dual Intel Xeon 
5130 CPU and 4GB memory. We measured all time val- 
ues using QueryPerformanceCounter (), a high- 
resolution performance counter. 

It appears that we should measure MODIST’s over- 
head by comparing a system’s executions with MODIST 
to those without. However, due to nondeterminism, we 
cannot compare these two directly: the executions with- 
out MODIST may run different program paths than those 
with MODIST. Moreover, repeated executions of the 
same testcase without MODIST may differ; we did ob- 
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System Strategy Real (s) Sleep (s) Speedup Overhead (absolute and relative) 
Berkeley DB RANDOM © 1,717+14 38,204+193 25.7+0.2 302 + 1s 17.7+0.1%) 
Berkeley DB DPOR 1,658 +24 36,4024-95,137 2226322 301+ 17s (18.2 +0.9%) 
MPS RANDOM — 1,661+20 240,568+1,405 216222 $29 tb 11s (49.90.27) 
MPS DPOR 1,853+116 295,435+45,659 159+19 1,048 + 108s (56.5 + 2.6%) 


Table 3: MODIST’s performance. All numbers are of the form average + standard deviation. 


serve a large variance in MPS’s execution times and final 
protocol states. Thus, we evaluated MODIST’s overhead 
by running a system with MODIST and measuring the 
time spent in MODIST’s components. 

Table 3 shows the performance results. The Real col- 
umn shows the time it took for MODIST to explore 1K 
paths of Berkeley DB and MPS with RANDOM and 
DPOR strategies; the exploration speed is roughly two 
seconds per path and does not change much for the two 
different search strategies. The Sleep column shows the 
time MODIST saved using its virtual clock when the 
target systems were asleep; we would have spent this 
amount of extra time had we run the same executions 
without MODIST. As shown in the table, the real execu- 
tion time is much smaller that the sleep time, translated 
into significant speedups (Column Speedup, computed 
as Sleep/Real). The Overhead column in this table 
shows the time spent in MODIST’s interposition, RPC, 
and backend scheduling. For Berkeley DB, MODIST ac- 
counts for about 18% of the real execution time. For 
MPS, MODIST accounts for a higher percentage of exe- 
cution time (up to 56.5%) because the MPS testcase we 
used is almost the worst case for MODIST: it only exer- 
cises the underlying communication protocol and does 
no real message processing. Nonetheless, we believe 
such overhead is reasonable for an error detection tool. 


4.7 Lessons 


This section discusses the lessons we learned. 

Real distributed protocols are buggy. We found 
many protocol-level bugs and we found them in every 
system we target, suggesting that real distributed proto- 
cols are buggy. Amusingly, these protocols are based 
on theoretically sound protocols; the bugs are introduced 
when developers filled in the unspecified parts in the pro- 
tocols in practice. 

Controlling all non-determinism is hard. System- 
atic checking requires control of non-determinism in the 
target system. This task is very hard given the non- 
determinism in the OS and network, the wide API inter- 
face, the many possible failures and their combinations, 
and MODIST’s goal of reducing intrusiveness to the tar- 
get system. We have had bitter experiences debugging 
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non-deterministic errors in Berkeley DB, which uses pro- 
cess id, memory address, and time to generate random 
numbers, and in MPS, which randomly interferes with 
the default Windows firewall. Among all, making the 
Windows socket APIs deterministic was the most diffi- 
cult; the interface shown in 83 went through several it- 
erations. Our own experiences show that controlling all 
non-determinism is much harder than merely capturing 
it as in replay-debugging tools. 

Avoid false positives at all cost. False positives may 
take several days to diagnose. Thus, we want to avoid 
them, even at the risk of missing errors. 


Leverage domain knowledge. In a sense, this entire 
paper boils down to leveraging the domain knowledge 
of distributed systems to better model-check them. The 
core idea of model checking is simple: explore all pos- 
sible executions; a much more difficult task is to imple- 
ment this idea effectively in an application domain. 


When in doubt, reboot. When we checked MPS, we 
were surprised by how robust it was. MPS uses a de- 
fensive programming technique that works particularly 
well in the context of distributed replication protocols. 
MPS extensively uses local assertions, reboots when any 
assertion fails, and relies on the replication protocol to 
recover from these eager reboots. This recovery mecha- 
nism makes MPS robust against a wide range of failures. 
Of course, rebooting is not without penalty: if a primary 
reboots, there could be noticeable performance degrada- 
tion, and the system also becomes less fault tolerant. 


5 Related Work 


5.1 Model Checking 


Model checkers have previously been used to find er- 
rors in both the design and the implementation of soft- 
ware [1, 6, 12, 17-19, 27, 28, 34, 38, 39]. Traditional 
model checkers require users to write an abstract model 
of the target system, which often incurs large up-front 
cost when checking large systems. In contrast, MODIST 
is an implementation-level model checker that checks 
code directly, thus avoids this cost. Below we compare 
MODIST to implementation-level model checkers. 


USENIX Association 
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Model checkers for distributed system. MODIST is 
most related to model checkers that check real distributed 
system implementations. CMC [27] is a stateful model 
checker that checks C code directly. It has been used 
to check network protocol implementations [27] and file 
systems [38]. However, to check a system, CMC re- 
quires invasive modifications to run the system inside 
CMC’s address space [39]. MaceMC [19] uses bounded 
depth first search combined with random walk to find 
safety and liveness bugs in a number of network pro- 
tocol implementations written in a domain-specific lan- 
guage. Compared to these two checkers, MODIST di- 
rectly checks live, unmodified distributed systems run- 
ning in their native execution environments, thus avoids 
the invasive modifications required by CMC, and the lan- 
guage restrictions [20] enforced by MaceMC. 


CrystalBall [37] detects and avoids errors in deployed 
distributed systems using an efficient global state col- 
lection and exploration technique. While CrystalBall is 
based on MaceMC and thus checks only systems writ- 
ten in the Mace language [20], its core technique may 
be portable to MODIST’s model checking framework to 
improve the reliability of general distributed systems. 


Other software model checkers. We compare 
MODIST to other closely related implementation-level 
model checkers. Our transparent checking approach 
is motivated by our previous work EXPLODE [39]. 
However, EXPLODE focuses on storage systems and 
does not check distributed systems. 


To our best knowledge, VeriSoft [12] is the first 
implementation-level model checker. It systematically 
explores the interleavings of concurrent C programs, and 
uses partial order reduction to soundly reduce the number 
of states it explores. It has been used to check industrial- 
strength programs [5]. 


Chess [28] is a stateless model checker for explor- 
ing the interleavings of multi-threaded programs. To 
avoid perturbing the target system, it also interposes on 
WinAPIs. In addition, Chess uses a context-bounding 
heuristic and a starvation-free scheduler to make its 
checking more efficient. It has been applied to several 
industry-scale systems and found many bugs. 


ISP [35] is an implementation-level model checker for 
MPI programs. It controls a MPI program by intercept- 
ing calls to MPI methods and reduces the state-space it 
explores using new partial order reduction algorithms. 


All three systems focus on checking interleavings 
of concurrent programs, thus do not address issues on 
checking real distributed systems, such as providing a 
transparent, distributed checking architecture and en- 
abling consistent and deterministic failure simulation 


5.2  Replay-based debugging 


A number of systems [11, 21, 32], including our pre- 
vious work [15, 25], use deterministic replay to debug 
distributed system. These approaches attack a different 
problem: when a bug occurs, how to capture its manifes- 
tation so that developers can reproduce the bug. Com- 
bined with fault injection, these tools can be used to de- 
tect bugs. Like these systems, MODIST also provides re- 
producibility of errors. Unlike these systems, MODIST 
aims to proactively drive the target system into corner- 
cases for errors in the testing phase before the system is 
deployed. MODIST uses the instrumentation library in 
our previous work [25] to interpose on WinAPIs. 


5.3. Other error detection techniques 


We view testing as complementary to our approach. Test- 
ing is usually less comprehensive than our approach, but 
works “out of the box.” Thus, there is no reason not to 
use both testing and MODIST together. 

There has been much recent work on static bug finding 
(e.g., [1, 2, 7, 8, 10, 33]). Roughly speaking, because dy- 
namic checking runs code, it is limited to just executed 
paths, but can more effectively check deeper properties 
implied by the code (e.g., two replicas are consistent). 
The protocol-level errors we found would be difficult to 
find statically. We view static analysis as complemen- 
tary: easy enough to apply such that there is no reason 
not to use them together with MODIST. 

Recently, symbolic execution [3, 4, 13, 31] has been 
used to detect errors in real systems. This technique is 
good at detecting bugs caused by tricky input values, 
whereas our approach is good at detecting bugs caused 
by the non-deterministic events in the environment. 


6 Conclusions 


MODIST represents an important step in achieving the 
ideal of model checking unmodified distributed system 
in a transparent and effective way. Its effectiveness has 
been demonstrated by the subtle bugs it uncovered in 
well-tested production and deployed systems. 

Our experience shows that it requires a combination of 
art, science, and engineering. It is an art because various 
heuristics must be developed for finding delicate bugs 
effectively, taking into account the peculiarity of com- 
plex distributed systems; it is a science because a sys- 
tematic, modular approach with a carefully designed ar- 
chitecture is a key enabler; it involves heavy engineering 
effort to interpose between the application and the OS, 
to model and control low-level system behavior, and to 
handle system-level non-determinism. 
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Abstract 


We propose a new approach for developing and de- 
ploying distributed systems, in which nodes predict dis- 
tributed consequences of their actions, and use this in- 
formation to detect and avoid errors. Each node con- 
tinuously runs a state exploration algorithm on a re- 
cent consistent snapshot of its neighborhood and pre- 
dicts possible future violations of specified safety prop- 
erties. We describe a new state exploration algorithm, 
consequence prediction, which explores causally related 
chains of events that lead to property violation. 

This paper describes the design and implementation 
of this approach, termed CrystalBall. We evaluate Crys- 
talBall on RandTree, BulletPrime, Paxos, and Chord 
distributed system implementations. We identified new 
bugs in mature Mace implementations of three systems. 
Furthermore, we show that if the bug is not corrected 
during system development, CrystalBall is effective in 
steering the execution away from inconsistent states at 
runtime. 


1 Introduction 


Complex distributed protocols and algorithms are used in 
enterprise storage systems, distributed databases, large- 
scale planetary systems, and sensor networks. Errors 
in these protocols translate to denial of service to some 
clients, potential loss of data, and monetary losses. The 
Internet itself is a large-scale distributed system, and 
there are recent proposals [19] to improve its routing re- 
liability by further treating routing as a distributed con- 
sensus problem [26]. Design and implementation prob- 
lems in these protocols have the potential to deny vital 
network connectivity to a large fraction of users. 
Unfortunately, it is notoriously difficult to develop re- 
liable high-performance distributed systems that run over 
asynchronous networks. Even if a distributed system 1s 
based on a well-understood distributed algorithm, its im- 


/\ 





a b Cc 
Figure 1: Execution path coverage by a) classic model check- 
ing, b) replay-based or live predicate checking, c) CrystalBall 
in deep online debugging mode, and d) CrystalBall in execution 
steering mode. A triangle represents the state space searched by 
the model checker; a full line denotes an execution path of the 
system; a dashed line denotes an avoided execution path that 
would lead to an inconsistency. 


plementation can contain errors arising from complexi- 
ties of realistic distributed environments or simply cod- 
ing errors [27]. Many of these errors can only manifest 
after the system has been running for a long time, has de- 
veloped a complex topology, and has experienced a par- 
ticular sequence of low-probability events such as node 
resets. Consequently, it is difficult to detect such errors 
using testing and model checking, and many of such er- 
rors remain unfixed after the system is deployed. 

We propose to leverage increases in computing power 
and bandwidth to make it easier to find errors in dis- 
tributed systems, and to increase the resilience of the 
deployed systems with respect to any remaining errors. 
In our approach, distributed system nodes predict con- 
sequences of their actions while the system is running. 
Each node runs a state exploration algorithm on a consis- 
tent snapshot of its neighborhood and predicts which ac- 
tions can lead to violations of user-specified consistency 
properties. As Figure 1 illustrates, the ability to detect 
future inconsistencies allows us to address the problem 
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of reliability in distributed systems on two fronts: de- 
bugging and resilience. 


e Our technique enables deep online debugging be- 
cause it explores more states than live runs alone 
or model checking from the initial state. For each 
state that a running system experiences, our tech- 
nique checks many additional states that the system 
did not go through, but that it could reach in simi- 
lar executions. This approach combines benefits of 
distributed debugging and model checking. 


e Our technique aids resilience because a node can 
modify its behavior to avoid a predicted inconsis- 
tency. We call this approach execution steering. 
Execution steering enables nodes to resolve non- 
determinism in ways that aim to minimize future 
inconsistencies. 


To make this approach feasible, we need a fast 
state exploration algorithm. We describe a new algo- 
rithm, termed consequence prediction, which 1s efficient 
enough to detect future violations of safety properties in 
a running system. Using this approach we identified bugs 
in Mace implementations of a random overlay tree, and 
the Chord distributed hash table. These implementations 
were previously tested as well as model-checked by ex- 
haustive state exploration starting from the initial system 
state. Our approach therefore enables the developer to 
uncover and correct bugs that were not detected using 
previous techniques. Moreover, we show that, if a bug is 
not detected during system development, our approach is 
effective in steering the execution away from erroneous 
states, without significantly degrading the performance 
of the distributed service. 


1.1 Contributions 


We summarize the contributions of this paper as follows: 


e We introduce the concept of continuously executing 
a state space exploration algorithm in parallel with a 
deployed distributed system, and introduce an algo- 
rithm that produces useful results even under tight 
time constraints arising from runtime deployment; 


e We describe a mechanism for feeding a consis- 
tent snapshot of the neighborhood of a node in a 
large-scale distributed system into a running model 
checker; the mechanism enables reliable conse- 
quence prediction within limited time and band- 
width constraints; 


e We present execution steering, a technique that en- 
ables the system to steer execution away from pos- 
sible inconsistencies; 
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e We describe CrystalBall, the implementation of 
our approach on top of the Mace framework [21]. 
We evaluate CrystalBall on RandTree, Bullet’, 
Paxos, and Chord distributed system implementa- 
tions. CrystalBall detected several previously un- 
known bugs that can cause system nodes to reach 
inconsistent states. Moreover, if the developer is not 
in a position to fix these bugs, CrystalBall’s execu- 
tion steering predicts them in a deployed system and 
steers execution away from them, all with an accept- 
able impact on the overall system performance. 


1.2 Example 


We next describe an example of an inconsistency ex- 
hibited by a distributed system, then show how Crystal- 
Ball predicts and avoids it. The inconsistency appears 
in the Mace [21] implementation of the RandTree over- 
lay. RandTree implements a random, degree-constrained 
overlay tree designed to be resilient to node failures and 
network partitions. Trees built by an earlier version of 
this protocol serve as a control tree for a number of large- 
scale distributed services such as Bullet [23] and Ran- 
Sub [24]. In general, trees are used in a variety of mul- 
ticast scenarios [3, 7] and data collection/monitoring en- 
vironments [17]. Inconsistencies in these environments 
translate to denial of service to users, data loss, incon- 
sistent measurements, and suboptimal control decisions. 
The RandTree implementation was previously manually 
debugged both in local- and wide-area settings over a pe- 
riod of three years, as well as debugged using an existing 
model checking approach [22], but, to our knowledge, 
this inconsistency has not been discovered before (see 
Section 4 for some of the additional bugs that Crystal- 
Ball discovered). 

RandTree Topology. Nodes in a RandTree overlay form 
a directed tree of bounded degree. Each node maintains 
a list of its children and the address of the root. The node 
with the numerically smallest IP address acts as the root 
of the tree. Each non-root node contains the address of 
its parent. Children of the root maintain a sibling list. 
Note that, for a given node, its parent, children, and sib- 
lings are all distinct nodes. The seemingly simple task 
of maintaining a consistent tree topology is complicated 
by the requirement for groups of nodes to agree on their 
roles (root, parent, child, sibling) across asynchronous 
networks, in the face of node failures, and machine slow- 
downs. 

Joining the Overlay. A node 7; joins the overlay by 
issuing a Join request to one of the designated nodes. 
If the node receiving the join request is not the root, it 
forwards the request to the root. If the root already has 
the maximal number of children, it asks one of its chil- 
dren to incorporate the node into the overlay. Once the 
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Figure 2: An inconsistency in a run of RandTree 


request reaches a node n, whose number of children is 
less than maximum allowed, node n, inserts n; as one of 
its children, and notifies n; about a successful join using 
a JoinReply message (if n, is the root, it also notifies its 
other children about their new sibling n; using an Up- 
dateSibling message). 

Example System State. The first row of Figure 2 shows 
a state of the system that we encountered by running 
RandTree in the ModelNet cluster [43] starting from the 
initial state. We examine the local states of nodes 71, 
ng, and n;3. For each node n we display its neighbor- 
hood view as a small graph whose central node is n itself, 
marked with a circle. If a node is root and in a “joined” 
state, we mark it with a triangle in its own view. 

The state in the first row of Figure 2 is formed by 713 

joining as the only child of m9 and then n; joining and 
assuming the role of the new root with ng as its only child 
(n13 remains as the only child of mg). Although the fi- 
nal state shown in first row of Figure 2 is simple, it takes 
13 steps of the distributed system (such as atomic han- 
dler executions, including application events) to reach 
this state from the initial state. 
Scenario Exhibiting Inconsistency. Figure 2 describes 
a sequence of actions that leads to a state that violates the 
consistency of the tree. We use arrows to represent the 
sending and the receiving of some of the relevant mes- 
sages. A dashed line separates distinct distributed system 
states (for simplicity we skip certain intermediate states 
and omit some messages). 

The sequence begins by a silent reset of node 713 
(such reset can be caused by, for example, a power fail- 
ure). After the reset, n;3 attempts to join the overlay 
again. The root n; accepts the join request and adds 13 
as its child. Up to this point node ng received no infor- 





mation on actions that followed the reset of 213, so No 
maintains 773 as its own child. When n, accepts n13 as 
a child, it sends an UpdateSibling message to ng. At this 
point, N9 simply inserts n13 into the set of its sibling. As 
a result, 213 appears both in the list of children and in 
the list of siblings of m9, which is inconsistent with the 
notion of a tree. 

Challenges in Finding Inconsistencies. | We would 
clearly like to avoid inconsistencies such as the one ap- 
pearing in Figure 2. Once we have realized the pres- 
ence of such inconsistency, we can, for example, mod- 
ify the handler for the UpdateSibling message to re- 
move the new sibling from the children list. Previously, 
researchers had successfully used explicit-state model 
checking to identify inconsistencies in distributed sys- 
tems [22] and reported a number of safety and liveness 
bugs in Mace implementations. However, due to an ex- 
ponential explosion of possible states, current techniques 
capable of model checking distributed system implemen- 
tations take a prohibitively long time to identify inconsis- 
tencies, even for seemingly short sequences such as the 
ones needed to generate states in Figure 2. For exam- 
ple, when we applied the Mace Model Checker’s [22] 
exhaustive search to the safety properties of RandTree 
starting from the initial state, it failed to identify the in- 
consistency in Figure 2 even after running for 17 hours 
(on a 3.4-GHz Pentium-4 Xeon that we used for all our 
experiments in Section 4). The reason for this long run- 
ning time is the large number of states reachable from the 
initial state up to the depth at which the bug occurs, all 
of which are examined by an exhaustive search. 


1.3. CrystalBall Overview 


Instead of running the model checker from the initial 
state, we propose to execute a model checker concur- 
rently with the running distributed system, and contin- 
uously feed current system states into the model checker. 
When, in our example, the system reaches the state at the 
beginning of Figure 2, the model checker will predict the 
state at the end of Figure 2 as a possible future inconsis- 
tency. In summary, instead of trying to predict all possi- 
ble inconsistencies starting from the initial state (which 
for complex protocols means never exploring states be- 
yond the initialization phase), our model checker predicts 
inconsistencies that can occur in a system that has been 
running for a significant amount of time in a realistic en- 
vironment. 

As Figure | suggests, compared to the standard model 
checking approach, this approach identifies inconsisten- 
cies that can occur within much longer system execu- 
tions. Compared to simply running the system for a long 
time, our approach has two advantages. 


1. Our approach systematically covers a large number 
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of executions that contain low-probability events, 
such as node resets that ultimately triggered the in- 
consistency in Figure 2. It can take a very long time 
for a running system to encounter such a scenario, 
which makes testing for possible bugs difficult. Our 
technique therefore improves system debugging by 
providing a new technique that combines some of 
the advantages of testing and static analysis. 


2. Our approach identifies inconsistencies before they 
actually occur. This is possible because the model 
checker can simulate packet transmission in time 
shorter than propagation latency, and because it can 
simulate timer events in time shorter than than the 
actual time delays. This aspect of our approach 
opens an entirely new possibility: adapt the behav- 
ior of the running system on the fly and avoid an in- 
consistency. We call this technique execution steer- 
ing. Because it does not rely on a history of past in- 
consistencies, execution steering is applicable even 
to inconsistencies that were previously never ob- 
served in past executions. 
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Figure 3: An Example execution sequence that avoids 
the inconsistency from Figure 2 thanks to execution 
steering. 


Example of Execution Steering. In our example, a 
model checking algorithm running in 7, detects the vi- 
olation at the end of Figure 2. Given this knowledge, 
execution steering causes node nj, not to respond to the 
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Figure 4: High-level overview of CrystalBall 


join request of 13 and to break the TCP connection with 
it. Node n;3 eventually succeeds joining the random tree 
(perhaps after some other nodes have joined first). The 
stale information about 213 in No is removed once no 
discovers that the stale communication channel with n13 
is closed, which occurs the first time when ng attempts to 
communicate with n;3. Figure 3 presents one scenario il- 
lustrating this alternate execution sequence. Effectively, 
execution steering has exploited the non-determinism 
and robustness of the system to choose an alternative ex- 
ecution path that does not contain the inconsistency. 


2 CrystalBall Design 


We next sketch the design of CrystalBall (see [44] for 
details). Figure 4 shows the high-level overview of a 
CrystalBall-enabled node. We concentrate on distributed 
systems implemented as state machines, as this is a 
widely-used approach [21, 25, 26, 37, 39]. 

The state machine interfaces with the outside world 
via the runtime module. The runtime receives the mes- 
sages coming from the network, demultiplexes them, and 
invokes the appropriate state machine handlers. The 
runtime also accepts application level messages from 
the state machines and manages the appropriate network 
connections to deliver them to the target machines. This 
module also maintains the timers on behalf of all services 
that are running. 

The CrystalBall controller contains a checkpoint man- 
ager that periodically collects consistent snapshots of a 
node’s neighborhood. The controller feeds them to the 
model checker, along with a checkpoint of the local state. 
The model checker runs the consequence prediction al- 
gorithm which checks user- or developer-defined proper- 
ties and reports any violation in the form of a sequence 
of events that leads to an erroneous state. 

CrystalBall can operate in two modes. In the deep on- 
line debugging mode the controller only outputs the in- 
formation about the property violation. In the execution 
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steering mode the controller examines the report from 
the model checker, prepares an event filter that can avoid 
the erroneous condition, checks the filter’s impact, and 
installs it into the runtime if it is deemed to be safe. 


2.1 Consistent Neighborhood Snapshots 


To check system properties, the model checker requires 
a snapshot of the system-wide state. Ideally, every node 
would have a consistent, up-to-date checkpoint of ev- 
ery other participant’s state. Doing so would give ev- 
ery node high confidence in the reports produced by the 
model checker. However, given that the nodes could be 
spread over a high-latency wide-area network, this goal 
is unattainable. In addition, the sheer amount of band- 
width required to disseminate checkpoints might be ex- 
cessive. 


Given these fundamental limitations, we use a solution 
that aims for scalability: we apply model checking to a 
subset of all states in a distributed system. We lever- 
age the fact that in scalable systems a node typically 
communicates with a small subset of other participants 
(“neighbors’’) and perform model checking only on this 
neighborhood. In some distributed hash table implemen- 
tations, a node keeps track of O(log) other nodes; in 
mesh-based content distribution systems nodes commu- 
nicate with a constant number of peers; or this number 
does not explicitly grow with the size of the system. In a 
random overlay tree, a node is typically aware of the root, 
its parent, its children, and its siblings. We therefore ar- 
range for a node to distribute its state checkpoints to its 
neighbors, and we refer to them as snapshot neighbor- 
hood. The checkpoint manager maintains checkpoints 
and snapshots. Other CrystalBall components can re- 
quest an on-demand snapshot to be gathered by invoking 
an appropriate call on the checkpoint manager. 


Discovering and Managing Snapshot Neighborhoods. 
To propagate checkpoints, the checkpoint manager needs 
to know the set of a node’s neighbors. This set is depen- 
dent upon a particular distributed service. We use two 
techniques to provide this list. In the first scheme, we 
ask the developer to implement a method that will re- 
turn the list of neighbors. The checkpoint manager then 
periodically queries the service and updates its snapshot 
neighborhood. 


Since changing the service code might not always be 
possible, our second technique uses a heuristic to deter- 
mine the snapshot neighborhood. Specifically, we peri- 
odically query the runtime to obtain the list of open con- 
nections (for TCP), and recent message recipients (for 
UDP). We then cluster connection endpoints according 
to the communication times, and selects a sufficiently 
large cluster of recent connections. 


Enforcing Snapshot Consistency. To avoid false pos- 
itives, we ensure that the neighborhood snapshot corre- 
sponds to a consistent view of a distributed system at 
some point of logical time. There has been a large body 
of work in this area, starting with the seminal paper by 
Chandy and Lamport [5]. We use one of the recent algo- 
rithms for obtaining consistent snapshots [29], in which 
the general idea is to collect a set of checkpoints that 
do not violate the happens-before relationship [25] es- 
tablished by messages sent by the distributed service. 

Instead of gathering a global snapshot, a node peri- 
odically sends a checkpoint request to the members of 
its snapshot neighborhood. Even though nodes receive 
checkpoints only from a subset of nodes, all distributed 
service and checkpointing messages are instrumented to 
carry the checkpoint number (logical clock) and each 
neighborhood snapshot is a fragment of a globally con- 
sistent snapshot. In particular, a node that receives a mes- 
sage with a logical timestamp greater than its own logical 
clock takes a forced checkpoint. The node then uses the 
forced checkpoint to contribute to the consistent snap- 
shot when asked for it. 

Node failures are commonplace in distributed systems, 

and our algorithm has to deal with them. The check- 
point manager proclaims a node to be dead if it experi- 
ences a communication error (e.g., a broken TCP con- 
nection) with it while collecting a snapshot. An addi- 
tional cause for an apparent node failure is a change of 
a node’s snapshot neighborhood in the normal course of 
operation (e.g., when a node changes parents in the ran- 
dom tree). In this case, the node triggers a new snapshot 
gather operation. 
Checkpoint Content. Although the total footprint of 
some services might be very large, this might not nec- 
essarily be reflected in checkpoint size. For example, 
the Bullet’ [23] file distribution application has non- 
negligible total footprint, but the actual file content trans- 
ferred in Bullet’ does not play any role in consistency de- 
tection. In general, the checkpoint content is given by a 
serialization routine. The developer can choose to omit 
certain parts of the state from serialized content and re- 
construct them if needed at de-serialization time. As a re- 
sult, checkpoints are smaller, and the code compensates 
the lack of serialized state when a local state machine 
is being created from a remote node’s checkpoint in the 
model checker. We use a set of well-known techniques 
for managing checkpoint storage (quotas) and control- 
ling the bandwidth used by checkpoints (bandwidth lim- 
its, compression). 


2.2 Consequence Prediction Algorithm 


The key to enabling fast prediction of future inconsisten- 
cies in CrystalBall is our consequence prediction algo- 
rithm, presented in Figure 5. For readability, we present 
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1 proc findConseq(currentState : G, property : (G — boolean)) 
explored = emptySet(); errors = emptySet(); 
localExplored = emptySet(); 
frontier = emptyQueue(); 
frontier.addLast(currentState); 
while (!STOP_CRITERION) 

state = frontier.popFirst(); 

if (!property(state)) 

errors.add(state); // predicted inconsistency found 


10 explored.add(hash(state)); 


12 // process all network handlers 

13 foreach (((s,m),(s’,c)) € Haz where (n,m) € state.I) 

14 // node n handles message m according to st. machine 
15 addNextState(state,n,s,s’,{m},c); 

16 // process local actions only for fresh local states 

17 if (!localExplored.contains(hash(n,s))) 

18 foreach (((s,a),(s’,c)) € Ha) 

19 addNextState(state,n,s,s’,{ },c); 

20 localExplored.add(hash(n,s)); 


2 
2 
2 
2 
2 
2 
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1 foreach ((n,s) € state.L) //node n in local state s 


1 

2 proc addNextState(state,n,s,s’,cO,c) 

3 nextState.L = (state.L \ {(n,s)}) U {(n,s’)}; 
4 nextState.I = (state.I \ c0) Uc; 

5 if (!explored.contains(hash(nextState))) 

6 frontier.addLast(nextState); 


Figure 5: Consequence Prediction Algorithm 


the algorithm as a refinement of a generic state-space 
search. The notation is based on a high-level semantics 
of a distributed system, shown in Figure 6. (Our concrete 
model checker implementation uses an iterative deep- 
ening algorithm which combines memory efficiency of 
depth-first search, while favoring the states in the near fu- 
ture, as in breadth-first search.) The STOP_CRITERION 
in Figure 5 in our case is given by time constraints and 
external commands to restart the model checker upon the 
arrival of a new snapshot. 

In Line 8 of Figure 5 the algorithm checks whether the 
explored state satisfies the desired safety properties. The 
developer can use a simple language [22] that involves 
loops, existential and comparison operators, state vari- 
ables, and function invocations to specify the properties. 
Exploring Independent Chains. We can divide the 
actions in a distributed system into event chains, where 
each chain starts with an application or scheduler event 
and continues by triggering network events. We call two 
chains independent if no event of the first chain changes 
state of a node involved in the second chain. Conse- 
quence Prediction avoids exploring the interleavings of 
independent chains. Therefore, the test in Line 17 of Fig- 
ure 5 makes the algorithm re-explore the scheduler and 
application events of a node if and only if the previous 
events changed the local state of the node. For depen- 
dent chains, if a chain event changes local state of a node, 
Consequence Prediction therefore explores all other ac- 
tive chains which have been initiated from this node. 
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N — node identifiers 

S — node states 

M — message contents 

N x M — (destination process, message)-pair 
C =2Nx™ _ set of messages with destination 
A — local node actions (timers, application calls) 


system state : (L,I) €G, G=2N*S x QNx™M 
local node states : L C N x S (function from N to S) 
in-flight messages (network): J C N x M 

behavior functions for each node : 


message handler: Hy C (S x M) x (Sx C) 
internal action handler : Hy C (S' x A) x (S' x C) 


transition function for distributed system : 


node message handler execution : 

(51, m), (sa, c)) € Hy 

(Lo 8 {(n, 81) }, lo 8 {(n,m)})~ 
(Lo w {(n, S2)}, To U C) 


before: 
after: 


internal node action (timer, application calls) : 
((s1, a), (so, c)) = JiA 
before: (Lo W {(n, s1)},I)~ 
after: (DoW {(n,s2)},1Uc) 


Figure 6: A Simple Model of a Distributed System 


Note that hash(n, s) in Figure 5 implies that we have 
separate tables corresponding to each node for keeping 
hashed local states. If a state variable is not necessary 
to distinguish two separate states, the user can annotate 
the state variable that he or she does not want include 
in the hash function, improving the performance of Con- 
sequence Prediction. Instead of holding all encountered 
hashes, the hash table could be designed as a bounded 
cache to fit into the L2 cache or main memory, favor- 
ing access speed while admitting the possibility of re- 
exploring previously seen states. 

Although simple, the idea of removing from the search 
actions of nodes with previously seen states eliminates 
many (uninteresting) interleavings from search and has 
a profound impact on the search depth that the model 
checker can reach with a limited time budget. This 
change was therefore key to enabling the use of the 
model checker at runtime. Knowing that consequence 
prediction avoids considering certain states, the question 
remains whether the remaining states are sufficient to 
make the search useful. Ultimately, the answer to this 
question comes from our evaluation (Section 4). 


2.3 Execution Steering 


CrystalBall’s execution steering mode enables the sys- 
tem to avoid entering an erroneous state by steering its 
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execution path away from predicted inconsistencies. If a 
protocol was designed with execution steering in mind, 
the runtime system could report a predicted inconsis- 
tency as a special programming language exception, and 
allow the service to react to the problem using a service- 
specific policy. However, to measure the impact on exist- 
ing implementations, this paper focuses on generic run- 
time mechanisms that do not require the developer to in- 
sert exception-handling code. 

Event Filters. Recall that a node in our framework op- 
erates as a state machine and processes messages, timer 
events, and application calls via handlers. Upon noticing 
that running a certain handler can lead to an erroneous 
state, CrystalBall installs an event filter, which temporar- 
ily blocks the invocation of the state machine handler for 
messages from the relevant sender. 

The rationale is that a distributed system often con- 
tains a large amount of non-determinism that allows it 
to proceed even if certain transitions are disabled. For 
example, if the offending message is a Join request in 
a random tree, ignoring the message can prevent violat- 
ing a local state property. The joining nodes can later 
retry the procedure with an alternative potential parent 
and successfully join the tree. Similarly, if handling a 
message causes an equivalent of a race condition man- 
ifested as an inconsistency, delaying message handling 
allows the system to proceed to the point where handling 
the message becomes safe again. Note that state machine 
handlers are atomic, so CrystalBall is unlikely to inter- 
fere with any existing recovery code. 

Point of Intervention. In general, execution steering 
can intervene at several points in the execution path. Our 
current policy is to steer the execution as early as pos- 
sible. For example, if the erroneous execution path in- 
volves a node issuing a Join request after resetting, the 
system’s first interaction with that node occurs at the 
node which receives its join request. If this node dis- 
covers the erroneous path, it can install the event filter. 
Non-Disruptiveness of Execution Steering. Ideally, 
execution steering would always prevent inconsistencies 
from occurring, without introducing new inconsistencies 
due to a change in behavior. In general, however, guar- 
anteeing the absence of inconsistencies is as difficult as 
guaranteeing that the entire program is error-free. Crys- 
talBall therefore makes execution steering safe in prac- 
tice through two mechanisms: 


1. Sound Choice of Filters. It is important that 
the chosen corrective action does not sacrifice the 
soundness of the state machine. A sound filtering 1s 
the one in which the observed sequence of events 
after filtering is a subset of possible sequence of 
events without filtering. The breaking of a TCP 
connection is common in a distributed system using 
TCP. Therefore, such distributed systems include 


failure-handling code that deals with broken TCP 
connections. This makes sending a TCP RST signal 
a good candidate for a sound event filter, and is the 
filter we choose to use in CrystalBall. In the case 
of communication over UDP, the filter simply drops 
the UDP packet, which could similarly happen in 
normal operation of the network. 


2. Exploration of Corrected Executions. Before al- 
lowing the event filter to perform an execution steer- 
ing action, CrystalBall runs the consequence predic- 
tion algorithm to check the effect of the event filter 
action on the system. If the consequence prediction 
algorithm does not suggest that the filter actions are 
safe, CrystalBall does not attempt execution steer- 
ing and leaves the system to proceed as usual. 


Rechecking Previously Discovered Violations. An 
event filter reflects possible future inconsistencies reach- 
able from the current state, and leaving an event filter in 
place indefinitely could deny service to some distributed 
system participants. CrystalBall therefore removes the 
filters from the runtime after every model checking run. 
However, it is useful to quickly check whether the previ- 
ously identified error path can still lead to an erroneous 
condition in a new model checking run. This is espe- 
cially important given the asynchronous nature of the 
model checker relative to the system messages, which 
can prevent the model checker from running long enough 
to rediscover the problem. To prevent this from happen- 
ing, the first step executed by the model checker is to 
replay the previously discovered error paths. If the prob- 
lem reappears, CrystalBall immediately reinstalls the ap- 
propriate filter. 

Immediate Safety Check. CrystalBall also supports 
immediate safety check, a mechanism that avoids incon- 
sistencies that would be caused by executing the current 
handler. Such imminent inconsistencies can happen even 
in the presence of execution steering because 1) conse- 
quence prediction explores states given by only a subset 
of all distributed system nodes, and 2) the model checker 
runs asynchronously and may not always detect incon- 
sistencies in time. The immediate safety check specula- 
tively runs the handler, checks the consistency properties 
in the resulting state, and prevents actual handler execu- 
tion if the resulting state is inconsistent. 

We have found that exclusively using immediate 
safety check would not be sufficient for avoiding incon- 
sistencies. The advantages of installing event filters are: 
1) performance benefits of avoiding the bug sooner, e.g., 
reducing unnecessary message transmission, 11) faster re- 
action to an error, which implies greater chance of avoid- 
ing a “point of no return” after which error avoidance 
is impossible, and 111) the node that is supposed to ul- 
timately avoid the inconsistency by immediate safety 
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check might not have all the checkpoints needed to no- 
tice the violation; this can result in false negatives (as 
shown in Figure 9). 

Liveness Issues. It is possible that by applying an event 
filter would affect liveness properties of a distributed sys- 
tem. In our experience, due to a large amount of non- 
determinism (e.g., the node is bootstrapped with a list 
of multiple nodes it can join), the system usually finds 
a way to make progress. We focus on enforcing safety 
properties, and we believe that occasionally sacrificing 
liveness is a valid approach. According to a negative re- 
sult by Fischer, Lynch, and Paterson [12], it is impossible 
to have both in an asynchronous system anyway. (For ex- 
ample, the Paxos [26] protocol guarantees safety but not 
liveness.) 


2.4 Scope of Applicability 


CrystalBall does not aim to find all errors; it is rather 
designed to find and avoid important errors that can 
manifest in real runs of the system. Results in Sec- 
tion 4 demonstrate that CrystalBall works well in prac- 
tice. Nonetheless, we next discuss the limitations of our 
approach and characterize the scenarios in which we be- 
lieve CrystalBall to be effective. 

Up-to-Date Snapshots. For Consequence Prediction to 
produce results relevant for execution steering and imme- 
diate safety check, it needs to receive sufficiently many 
node checkpoints sufficiently often. (Thanks to snapshot 
consistency, this is not a problem for deep online debug- 
ging.) We expect the stale snapshots to be less of an issue 
with stable properties, e.g., those describing a deadlock 
condition [5]. Since the node’s own checkpoint might 
be stale (because of enforcing consistent neighborhood 
snapshots for checking multi-node properties), immedi- 
ate safety check is perhaps more applicable to node-local 
properties. 

Higher frequency of changes in state variables re- 
quires higher frequency of snapshot exchanges. High- 
frequency snapshot exchanges in principle lead to: 1) 
more frequent model checker restarts (given the difficulty 
in building incremental model checking algorithms), and 
2) high bandwidth consumption. Among the examples 
for which our techniques is appropriate are overlays in 
which state changes are infrequent. 

Consequence Prediction as a Heuristic. Consequence 
Prediction is a heuristic that explores a subset of the 
search space. This is an expected limitation of explicit- 
state model checking approaches applied to concrete im- 
plementations of large software systems. The key ques- 
tion in these approaches is directing the search towards 
most interesting states. Consequence Prediction uses in- 
formation about the nature of the distributed system to 
guide the search; the experimental results in Section 4 
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show that it works well in practice, but we expect that 
further enhancements are possible. 


3 Implementation Highlights 


We built CrystalBall on top of the Mace [21] framework. 
Mace allows distributed systems to be specified suc- 
cinctly and outputs high-performance C++ code. We im- 
plemented our consequence prediction within the Mace 
model checker, and run the model checker as a separate 
thread that communicates future inconsistencies to the 
runtime. Our current implementation of the immediate 
safety check executes the handler in a copy of the state 
machine’s virtual memory (using fork()), and holds the 
transmission of messages until the successful completion 
of the consistency check. Upon encountering an incon- 
sistency in the copy, the runtime does not execute the 
handler in the primary state machine. In case of appli- 
cations with high messaging/state change rates in which 
the performance of immediate safety check is critical, we 
could obtain a state checkpoint [41] before running the 
handler and rollback to it in case of an encountered in- 
consistency. Another option would be to employ operat- 
ing system-level speculation [32]. 


4 Evaluation 


Our experimental evaluation addresses the following 
questions: 1) Is CrystalBall effective in finding bugs in 
live runs? 2) Can any of the bugs found by Crystal- 
Ball also be identified by the MaceMC model checker 
alone? 3) Is execution steering capable of avoiding in- 
consistencies in deployed distributed systems? 4) Are the 
CrystalBall-induced overheads within acceptable levels? 


4.1 Experimental Setup 


We conducted our live experiments using ModelNet [43]. 
ModelNet allows us to run live code in a cluster of 
machines, while application packets are subjected to 
packet delay, loss, and congestion typical of the Inter- 
net. Our cluster consists of 17 older machines with dual 
3.4 GHz Pentium-4 Xeons with hyper-threading, 8 ma- 
chines with dual 2.33 GHz dual-core Xeon 5140s, and 3 
machines with 2.83 GHz Xeon X3360s (for Paxos exper- 
iments). Older machines have 2 GB of RAM, while the 
newer ones have 4 GB and 8 GB. These machines run 
GNU/Linux 2.6.17. One 3.4 GHz Pentium-4 machine 
running FreeBSD 4.9 served as the ModelNet packet for- 
warder for these experiments. All machines are intercon- 
nected with a full-rate 1-Gbps Ethernet switch. 

We consider two deployment scenarios. For our large- 
scale experiments with deep online debugging, we mul- 
tiplex 100 logical end hosts running the distributed ser- 
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vice across the 20 Linux machines, with 2 participants 
running the model checker on 2 different machines. We 
run with 6 participants for small-scale debugging exper- 
iments, one per machine. 

We use a 5,000-node INET [6] topology that we fur- 
ther annotate with bandwidth capacities for each link. 
The INET topology preserves the power law distribution 
of node degrees in the Internet. We keep the latencies 
generated by the topology generator; the average net- 
work RTT is 130ms. We randomly assign participants 
to act as clients connected to one-degree stub nodes in 
the topology. We set transit-transit links to be 100 Mbps, 
while we set access links to 5 Mbps/1 Mbps inbound- 
/outbound bandwidth. To emulate the effects of cross 
traffic, we instruct ModelNet to drop packets at random 
with a probability chosen uniformly at random between 
[0.001,0.005] separately for each link. 


4.2 Deep Online Debugging Experience 


We have used CrystalBall to find inconsistencies (vio- 
lations of safety properties) in two mature implemented 
protocols in Mace, namely an overlay tree (RandTree) 
and a distributed hash table (Chord [42]). These im- 
plementation were not only manually debugged both 
in local- and wide-area settings, but were also model 
checked using MaceMC [22]. We have also used our 
tool to find inconsistencies in Bullet’, a file distribu- 
tion system that was originally implemented in MACE- 
DON [37], and then ported to Mace. We found 13 new 
subtle bugs in these three systems that caused violation 
of safety properties. 


Bugs found | LOC Mace/C++ 
309 72000 


3 254 [2200 
2870 / 19628 


Table 1: Summary of inconsistencies found for each system 
using CrystalBall. LOC stands for lines of code and reflects 
both the MACE code size and the generated C++ code size. 
The low LOC counts for Mace service implementations are 
a result of Mace’s ability to express these services succinctly. 
This number does not include the line counts for libraries and 
low-level services that services use from the Mace framework. 





Table 1 summarizes the inconsistencies that Crystal- 
Ball found in RandTree, Chord and Bullet’. Typical 
elapsed times (wall clock time) until finding an incon- 
sistency in our runs have been from less than an hour up 
to a day. This time allowed the system being debugged 
to go through complex realistic scenarios.! CrystalBall 


'During this time, the model checker ran concurrently with a nor- 
mally executing system. We therefore do not consider this time to be 
wasted by the model checker before deployment; rather, it is the time 
consumed by a running system. 


identified inconsistencies by running consequence pre- 
diction from the current state of the system for up to sev- 
eral hundred seconds. To demonstrate their depth and 
complexity, we detail four out of 13 inconsistencies we 
found in the three services we examined. 


4.2.1 Example RandTree Bugs Found 


We next discuss bugs we identified in the RandTree over- 
lay protocol presented in Section 1.2. We name bugs ac- 
cording to the consistency properties that they violate. 
Children and Siblings Disjoint. The first safety prop- 
erty we considered is that the children and sibling lists 
should be disjoint. CrystalBall identified the scenario 
from Figure 2 in Section 1.2 that violates this property. 
The problem can be corrected by removing the stale in- 
formation about children in the handler for the Update- 
Sibling message. CrystalBall also identified variations of 
this bug that requires changes in other handlers. 
Recovery Timer Should Always Run. An important 
safety property for RandTree is that the recovery timer 
should always be scheduled. This timer periodically 
causes the nodes to send Probe messages to the peer list 
members with which it does not have direct connection. 
It is vital for the tree’s consistency to keep nodes up-to- 
date about the global structure of the tree. The property 
was written by the authors of [22] but the authors did not 
report any violations of it. We believe that our approach 
discovered it in part because our experiments considered 
more complex join scenarios. 

Scenario exhibiting inconsistency. CrystalBall found a 
violation of the property in a state where node A joins it- 
self, and changes its state to “joined” but does not sched- 
ule any timers. Although this does not cause problems 
immediately, the inconsistency happens when another 
node 6 with smaller identifier tries to join, at which point 
A gives up the root position, selects B as the root, and 
adds B it to its peer list. At this point A has a non-empty 
peer list but no running timer. 

Possible correction. Keep the timer scheduled even 
when a node has an empty peer list. 


4.2.2 Example Chord Bug Found 


We next describe a violation of a consistency property 
in Chord [42], a distributed hash table that provides key- 
based routing functionality. Chord and other related dis- 
tributed hash tables form a backbone of a large number of 
proposed and deployed distributed systems [17, 35, 38]. 

Chord Topology. Each Chord node is assigned a Chord 
id (effectively, a key). Nodes arrange themselves in an 
overlay ring where each node keeps pointers to its prede- 
cessor and successor. Even in the face of asynchronous 
message delivery and node failures, Chord has to main- 
tain a ring in which the nodes are ordered according to 
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Figure 7: An inconsistency in a run of Chord. Node C has its 
predecessor pointing to itself while its successor list includes 
other nodes. 


their ids, and each node has a set of “fingers” that enables 
it to reach exponentially larger distances on the ring. 
Joining the System. To join the Chord ring, a node A 
first identifies its potential predecessor by querying with 
its id. This request is routed to the appropriate node P, 
which in turn replies to A. Upon receiving the reply, 
A inserts itself between P and P’s successor, and sends 
the appropriate messages to its predecessor and succes- 
sor nodes to update their pointers. A “stabilize” timer 
periodically updates these pointers. 

Property: If Successor is Self, So Is Predecessor. If 
a predecessor of a node A equals A, then its successor 
must also be A (because then A is the only node in the 
ring). This is a safety property of Chord that had been 
extensively checked using MaceMC, presumably using 
both exhaustive search and random walks. 

Scenario exhibiting inconsistency: CrystalBall found 
a state where node A has A as a predecessor but has an- 
other node B as its successor. This violation happens 
at depths that are beyond those reachable by exhaustive 
search from the initial state. Figure 7 shows the scenario. 
During live execution, several nodes join the ring and all 
have a consistent view of the ring. Three nodes A, B, 
and C’ are placed consecutively on the ring, i.e., A is pre- 
decessor of B and B is predecessor of C’. Then B expe- 
riences a node reset and other nodes which have estab- 
lished TCP connection with B receive a TCP RST. Upon 
receiving this error, node A removes PB from its internal 
data structures. As a consequence, Node A considers C’ 
as its immediate successor. 

Starting from this state, consequence prediction de- 
tects the following scenario that leads to violation. C’ 
experiences a node reset, losing all its state. C’ then tries 
to rejoin the ring and sends a FindPred message to A. 
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Because nodes A and C’ did not have an established TCP 
connection, A does not observe the reset of C’. Node A 
replies to C’ by a FindPredReply message that shows A’s 
successor to be C’. Upon receiving this message, node C’ 
i) sets its predecessor to A; 11) stores the successor list in- 
cluded in the message as its successor list; and i11) sends 
an UpdatePred message to A’s successor which, in this 
case, is C’ itself. After sending this message, C’ receives 
a transport error from A and removes A from all of its 
internal structures including the predecessor pointer. In 
other words, C’’s predecessor would be unset. Upon re- 
ceiving the (loopback) message to itself, C’ observes that 
the predecessor is unset and then sets it to the sender of 
the UpdatePred message which is C’. Consequently, C’ 
has its predecessor pointing to itself while its successor 
list includes other nodes. 

Possible corrections. One possibility is for nodes to 
avoid sending UpdatePred messages to themselves (this 
appears to be a deliberate coding style in Mace Chord). 
If we wish to preserve such coding style, we can alterna- 
tively place a check after updating a node’s predecessor: 
if the successor list includes nodes in addition to itself, 
avoid assigning the predecessor pointer to itself. 


4.2.3 Example Bullet’ Bug Found 


Next, we describe our experience of applying Crystal- 
Ball to the Bullet’ [23] file distribution system. The 
Bullet’ source sends the blocks of the file to a subset of 
nodes in the system; other nodes discover and retrieve 
these blocks by explicitly requesting them. Every node 
keeps a file map that describes blocks that it currently 
has. A node participates in the discovery protocol driven 
by RandTree, and peers with other nodes that have the 
most disjoint data to offer to it. These peering relation- 
ships form the overlay mesh. 

Bullet’ is more complex than RandTree, Chord (and 

tree-based overlay multicast protocols) because of 1) the 
need for senders to keep their receivers up-to-date with 
file map information, 2) the block request logic at the re- 
ceiver, and 3) the finely-tuned mechanisms for achieving 
high throughput under dynamic conditions. The starting 
point for our exploration was property 1): 
Sender’s File Map and Receivers View of it Should 
Be Identical. Every sender keeps a “shadow” file map 
for each receiver informing it which are the blocks it 
has not told the receiver about. Similarly, a receiver 
keeps a file map that describes the blocks available at 
the sender. Senders use the shadow file map to compute 
“diffs”? on-demand for receivers containing information 
about blocks that are “new” relative to the last diff. 

Senders and receivers communicate over non- 
blocking TCP sockets that are under control of MaceTcp- 
Transport. This transport queues data on top of the TCP 
socket buffer, and refuses new data when its buffer is full. 
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Scenario exhibiting inconsistency: In a live run last- 
ing less than three minutes, CrystalBall quickly identi- 
fied a mismatch between a sender’s file map and the re- 
ceiver’s view of it. The problem occurs when the diff 
cannot be accepted by the underlying transport. The 
code then clears the receiver’s shadow file map, which 
means that the sender will never try again to inform the 
receiver about the blocks containing that diff. Interest- 
ingly enough, this bug existed in the original MACE- 
DON implementation, but there was an attempt to fix 
it by the UCSD researchers working on Mace. The at- 
tempted fix consisted of retrying later on to send a diff 
to the receiver. Unfortunately, since the programmer left 
the code for clearing the shadow file map after a failed 
send, all subsequent diff computations will miss the af- 
fected blocks. 


Possible corrections. Once the inconsistency is identi- 
fied, the fix for the bug is easy and involves not clearing 
the sender’s file map for the given receiver when a mes- 
sage cannot be queued in the underlying transport. The 
next successful enqueuing of the diff will then correctly 
include the block info. 


4.3 Comparison with MaceMC 


To establish the baseline for model checking perfor- 
mance and effectiveness, we installed our safety prop- 
erties in the original version of MaceMC [22]. We then 
ran it for the three distributed services for which we iden- 
tified safety violations. After 17 hours, exhaustive search 
did not identify any of the violations caught by Crystal- 
Ball, and reached the depth of only Some of the specific 
depths reached by the model checker are as follows 1) 
RandTree with 5 nodes: 12 levels, 2) RandTree with 100 
nodes: 1 level, 3) Chord with 5 nodes: 14 levels, and 
Chord with 100 nodes: 2 levels. This illustrates the limi- 
tations of exhaustive search from the initial state. 


In another experiment, we additionally employed 
random walk feature of MaceMC. Using this setup, 
MaceMC identified some of the bugs found by Crystal- 
Ball, but it still failed to identify 2 Randtree, 2 Chord, and 
3 Bullet’ bugs found by CrystalBall. In Bullet’, MaceMC 
found no bugs despite the fact that the search lasted 32 
hours. Moreover, even for the bugs found, the long list of 
events that lead to a violation (on the order of hundreds) 
made it difficult for the programmer to identify the error 
(we spent five hours tracing one of the violations involv- 
ing 30 steps). Such a long event list is unsuitable for 
execution steering, because it describes a low probabil- 
ity way of reaching the final erroneous state. In contrast, 
CrystalBall identified violations that are close to live ex- 
ecutions and therefore more likely to occur in the imme- 
diate future. 


4.4 Execution Steering Experience 


We next evaluate the capability of CrystalBall as a run- 
time mechanism for steering execution away from previ- 
ously unknown bugs. 


4.4.1 RandTree Execution Steering 


To estimate the impact of execution steering on de- 
ployed systems, we instructed the CrystalBall controller 
to check for violations of RandTree safety properties (in- 
cluding the one described in Section 4.2.1). We ran a 
live churn scenario in which one participant (process in a 
cluster) per minute leaves and enters the system on aver- 
age, with 25 tree nodes mapped onto 25 physical cluster 
machines. Every node was configured to run the model 
checker. The experiment ran for 1.4 hours and resulted 
in the following data points, which suggest that in prac- 
tice the execution steering mechanism is not disruptive 
for the behavior of the system. 


When CrystalBall is not active, the system goes 
through a total of 121 states that contain inconsisten- 
cies. When only the immediate safety check but not the 
consequence prediction is active, the immediate safety 
check engages 325 times, a number that is higher be- 
cause blocking a problematic action causes further prob- 
lematic actions to appear and be blocked successfully. 
Finally, we consider the run in which both execution 
steering and the immediate safety check (as a fallback) 
are active. Execution steering detects a future inconsis- 
tency 480 times, with 65 times concluding that chang- 
ing the behavior is unhelpful and 415 times modifying 
the behavior of the system. The immediate safety check 
fallback engages 160 times. Through a combined action 
of execution steering and immediate safety check, Crys- 
talBall avoided all inconsistencies, so there were no un- 
caught violations (false negatives) in this experiment. 


To understand the impact of CrystalBall actions on the 
overall system behavior, we measured the time needed 
for nodes to join the tree. This allowed us to empirically 
address the concern that TCP reset and message block- 
ing actions can in principle cause violations of liveness 
properties (in this case extending the time nodes need to 
join the tree). Our measurements indicated an average 
node join times between 0.8 and 0.9 seconds across dif- 
ferent experiments, with variance exceeding any differ- 
ence between the runs with and without CrystalBall. In 
summary, CrystalBall changed system actions 415 times 
(2.77% of the total of 14956 actions executed), avoided 
all specified inconsistencies, and did not degrade system 
performance. 
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Figure 8: Scenario that exposes a previously reported Paxos 
violation of a safety property (two different values are chosen 
in the same instance). 


4.4.2 Paxos Execution Steering 


Paxos [26] is a well known fault-tolerant protocol for 
achieving consensus in distributed systems. Recently, 
it has been successfully integrated in a number of de- 
ployed [4, 28] and proposed [19] distributed systems. In 
this section, we show how execution steering can be ap- 
plied to Paxos to steer away from realistic bugs that have 
occurred in previously deployed systems [4, 28]. The 
Paxos protocol includes five steps: 


1. A leader tries to take the leadership position by 
sending Prepare messages to acceptors, and it in- 
cludes a unique round number in the message. 


2. Upon receiving a Prepare message, each acceptor 
consults the last promised round number. If the 
message’s round number is greater than that num- 
ber, the acceptor responds with a Promise message 
that contains the last accepted value if there is any. 


3. Once the leader receives a Promise message from 
the majority of acceptors, it broadcasts an Accept 
request to all acceptors. This message contains 
the value of the Promise message with the highest 
round number, or is any value if the responses re- 
ported no proposals. 


4. Upon the receipt of the Accept request, each accep- 
tor accepts it by broadcasting a Learn message con- 
taining the Accepted value to the learners, unless it 
had made a promise to another leader in the mean- 
while. 


5. By receiving Learn messages from the majority of 
the nodes, a learner considers the reported value as 
chosen. 


The implementation we used was a baseline Mace 
Paxos implementation that includes a minimal set of fea- 
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Figure 9: In 200 runs that expose Paxos safety violations due 
to two injected errors, CrystalBall successfully avoided the in- 
consistencies in all but | and 4 cases, respectively. 


tures. In general, a physical node can implement one or 
more of the roles (leader, acceptor, learner) in the Paxos 
algorithm; each node plays all the roles in our experi- 
ments. The safety property we installed is the original 
Paxos safety property: at most one value can be chosen, 
across all nodes. The first bug we injected [28] is related 
to an implementation error in step 3, and we refer to it 
as bug/: once the leader receives the Promise message 
from the majority of nodes, it creates the Accept request 
by using the submitted value from the last Promise mes- 
sage instead of the Promise message with highest round 
number. Because the rate at which the violation (due to 
the injected error) occurs was low, we had to schedule 
some events to lead the live run toward the violation in 
a repeatable way. The setup we use comprises 3 nodes 
and two rounds, without any artificial packet delays. As 
illustrated in Figure 8, in the first round the communi- 
cation between node C’ and the other nodes is broken. 
Also, a Learn packet is dropped from A to B. At the end 
of this round, A chooses the value proposed by itself (0). 
In the second round, the communication between A and 
other nodes is broken. At the end of this round, the value 
proposed by B (1) is chosen by B itself. 

The second bug we injected (inspired by [4]) involves 
keeping a promise made by an Acceptor, even after 
crashes and reboots. As pointed in [4], it is often diffi- 
cult to implement this aspect correctly, especially under 
various hardware failures. Hence, we inject an error in 
the way a promise is kept by not writing it to disk (we 
refer to it as bug2). To expose this bug we use a scenario 
similar to the one used for bug/, with the addition of a 
reset of node B. 

To stress test CrystalBall’s ability to avoid inconsis- 
tencies at runtime, we repeat the live scenarios in the 
cluster 200 times (100 times for each bug) while vary- 
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ing the time between rounds uniformly at random be- 
tween O and 20 seconds. As we can see in Figure 9, 
CrystalBall’s execution steering is successful in avoid- 
ing the inconsistency at runtime 74% and 89% of the 
time for bug! and bug2, respectively. In these cases, 
CrystalBall starts model checking after node C’ recon- 
nects and receives checkpoints from other participants. 
After running the model checker for 3.3 seconds, C’ suc- 
cessfully predicts that the scenario in the second round 
would result in violation of the safety property, and it 
then installs the event filter. The avoidance by execution 
steering happens when C' rejects the Propose message 
sent by B. Execution steering is more effective for bug2 
than for bug/, as the former involves resetting 6. This 
in turn leaves more time for the model checker to redis- 
cover the problem by: 1) consequence prediction, or 11) 
replaying a previously identified erroneous scenario. Im- 
mediate safety check engages 25% and 7% of the time, 
respectively (in cases when model checking did not have 
enough time to uncover the inconsistency), and prevents 
the inconsistency from occurring later, by dropping the 
Learn message from C’' at node B. CrystalBall could not 
prevent the violation for only 1% and 4% of the runs, re- 
spectively. The cause for these false negatives was the 
incompleteness of the set of checkpoints. 


4.5 Performance Impact of CrystalBall 


Memory, CPU, and bandwidth consumption. —Be- 
cause consequence prediction runs in a separate process 
that is most likely mapped to a different CPU core on 
modern processors, we expect little impact on the ser- 
vice performance. In addition, since the model checker 
does not cache previously visited states (it only stores 
their hashes) the memory is unlikely to become a bottle- 
neck between the model-checking CPU core and the rest 
of the system. 

One concern with state exploration such as model- 
checking is the memory consumption. Figure 10 shows 
the consequence prediction memory footprint as a func- 
tion of search depth for our RandTree experiments. As 
expected, the consumed memory increases exponentially 
with search depth. However, since the effective Crystal- 
Ball’s search depth in is less than 7 or 8, the consumed 
memory by the search tree is less than 1 MB and can thus 
easily fit in the L2 or L3 (most recently) cache of the 
state of the art processors. Having the entire search tree 
in-cache reduces the access rate to main memory and im- 
proves performance. 

In the deep online debugging mode, the model checker 
was running for 950 seconds on average in the 100-node 
case, and 253 seconds in the 6-node case. When running 
in the execution steering mode (25 nodes), the model 
checker ran for an average of about 10 seconds. The 
checkpointing interval was 10 seconds. 
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Figure 10: The memory consumed by consequence prediction 
(RandTree, depths 7 to 8) fits in an L2 CPU cache. 
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Figure 11: CrystalBall slows down Bullet’ by less than 10% 
for a 20 MB file download. 


The average size of a RandTree node checkpoint is 
176 bytes, while a Chord checkpoint requires 1028 bytes. 
Average per-node bandwidth consumed by checkpoints 
for RandTree and Chord (100-nodes) was 803 bps and 
8224 bps, respectively. These figures show that over- 
heads introduced by CrystalBall are low. Hence, we did 
not need to enforce any bandwidth limits in these cases. 
Overhead from Checking Safety Properties. In prac- 
tice we did not find the overhead of checking safety prop- 
erties to be a problem because: 1) the number of nodes in 
a snapshot is small, 11) the most complex of our proper- 
ties have O(n”) complexity, where n is the number of 
nodes, and i111) the state variables fit into L2 cache. 
Overall Impact. Finally, we demonstrate that having 
CrystalBall monitor a bandwidth-intensive application 
featuring a non-negligible amount of state such as Bullet’ 
does not significantly impact the application’s perfor- 
mance. In this experiment, we instructed 49 Bullet’ in- 
stances to download a 20 MB file. Bullet’ is not a CPU 
intensive application, although computing the next block 
to request from a sender has to be done quickly. It 
is therefore interesting to note that in 34 cases during 
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this experiment the Bullet’ code was competing with the 
model checker for the Xeon CPU with hyper-threading. 
Figure 11 shows that in this case using CrystalBall re- 
duced performance by less than 5%. Compressed Bullet’ 
checkpoints were about 3 kB in size, and the bandwidth 
that was used for checkpoints was about 30 Kbps per 
node (3% of a node’s outbound bandwidth of 1 Mbps). 
The reduction in performance is therefore primarily due 
to the bandwidth consumed by checkpoints. 


5 Related Work 


Debugging distributed systems is a notoriously difficult 
and tedious process. Developers typically start by us- 
ing an ad-hoc logging technique, coupled with strenuous 
rounds of writing custom scripts to identify problems. 
Several categories of approaches have gone further than 
the naive method, and we explain them in more detail in 
the remainder of this section. 

Collecting and Analyzing Logs. Several approaches 
(Magpie [2], Pip [34]) have successfully used exten- 
sive logging and off-line analysis to identify performance 
problems and correctness issues in distributed systems. 
Unlike these approaches, CrystalBall works on deployed 
systems, and performs an online analysis of the system 
State. 

Deterministic Replay with Predicate Checking. Fri- 
day [14] goes one step further than logging to en- 
able a gdb-like replay of distributed systems, including 
watch points and checking for global predicates. WiDS- 
checker [28] is a similar system that relies on a combi- 
nation of logging/checkpointing to replay recorded runs 
and check for user predicate violations. WiDS-checker 
can also work as a simulator. In contrast to replay- 
and simulation-based systems, CrystalBall explores ad- 
ditional states and can steer execution away from erro- 
neous states. 

Online Predicate Checking. Singh et al. [40] have 
advocated debugging by online checking of distributed 
system state. Their approach involves launching queries 
across the distributed system that is described and 
deployed using the OverLog/P2 [40] declarative lan- 
guage/runtime combination. D?S [27] enables develop- 
ers to specify global predicates which are then automati- 
cally checked in a deployed distributed system. By using 
binary instrumentation, D3S can work with legacy sys- 
tems. Specialized checkers perform predicate-checking 
topology on snapshots of the nodes’ states. To make 
the snapshot collection scalable, the checker’s snapshot 
neighborhood can be manually configured by the devel- 
oper. This work has shown that it is feasible to collect 
snapshots at runtime and check them against a set of 
user-specified properties. CrystalBall advances the state- 
of-the-art in online debugging in two main directions: 
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1) it employs an efficient algorithm for model checking 
from a live state to search for bugs “deeper” and “wider” 
than in the live run, and it 2) enables execution steering to 
automatically prevent previously unidentified bugs from 
manifesting themselves in a deployed system. 

Model Checking. Model checking techniques for finite 
state systems [16, 20] have proved successful in anal- 
ysis of concurrent finite state systems, but require the 
developer to manually abstract the system into a finite- 
state model which is accepted as the input to the system. 
Early efforts on explicit-state model checking of C and 
C++ implementations [31, 30, 46] have primarily con- 
centrated on a single-node view of the system. 

MODIST [45] and MaceMC [22] represent the state- 
of-the-art in model checking distributed system imple- 
mentations. MODIST [45] is capable of model check- 
ing unmodified distributed systems; it orchestrates state 
space exploration across a cluster of machines. MaceMC 
runs state machines for multiple nodes within the same 
process, and can determine safety and liveness viola- 
tions spanning multiple nodes. MaceMC’s exhaustive 
state exploration algorithm limits in practice the search 
depth and the number of nodes that can be checked. In 
contrast, CrystalBall’s consequence prediction allows it 
to achieve significantly shorter running times for similar 
depths, thus enabling it to be deployed at runtime. In 
[22] the authors acknowledge the usefulness of prefix- 
based search, where the execution starts from a given 
supplied state. Our work addresses the question of ob- 
taining prefixes for prefix-based search: we propose to 
directly feed into the model checker states as they are 
encountered in live system execution. Using CrystalBall 
we found bugs in code that was previously debugged in 
MaceMC and that we were not able to reproduce using 
MaceMC’s search. In summary, CrystalBall differs from 
MODIST and MaceMC by being able to run state space 
exploration from live state. Further, CrystalBall supports 
execution steering that enables it to automatically pre- 
vent the system from entering an erroneous state. 


Cartesian abstraction [1] is a technique for over- 
approximating state space that treats different state com- 
ponents independently. The independence idea is also 
present in our consequence prediction, but, unlike over- 
approximating analyses, bugs identified by consequence 
search are guaranteed to be real with respect to the model 
explored. The idea of disabling certain transitions in 
state-space exploration appears in partial-order reduction 
(POR) [15],[13]. Our initial investigation suggests that a 
POR algorithm takes considerably longer than the con- 
sequence prediction algorithm. The advantage of POR 
is its completeness, but completeness is of second-order 
importance in our case because no complete search can 
terminate in a reasonable amount of time for state spaces 
of distributed system implementations. 
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Runtime Mechanisms. In the context of operating sys- 
tems, researchers have proposed mechanisms that safely 
re-execute code in a changed environment to avoid er- 
rors [33]. Such mechanisms become difficult to deploy 
in the context of distributed systems. Distributed transac- 
tions are a possible alternative to execution steering, but 
involve several rounds of communication and are inap- 
plicable in environments such as wide-area networks. A 
more lightweight solution involves forming a FUSE [11] 
failure group among all nodes involved in a join process. 
Making such approaches feasible would require collect- 
ing snapshots of the system state, as in CrystalBall. Our 
execution steering approach reduces the amount of work 
for the developer because it does not require code mod- 
ifications. Moreover, our experimental results show an 
acceptable computation and communication overhead. 

In Vigilante [9] and Bouncer [8], end hosts cooper- 
ate to detect and inform each other about worms that 
exploit even previously unknown security holes. Hosts 
protect themselves by generating filters that block bad 
inputs. Relative to these systems, CrystalBall deals with 
distributed system properties, and predicts inconsisten- 
cies before they occur. 

Researchers have explored modifying actions of con- 
current programs to reduce data races [18] by inserting 
locks in an approach that does not employ running static 
analysis at runtime. Approaches that modify state of a 
program at runtime include [10, 36]; these approaches 
enforce program invariants or memory consistency with- 
out computing consequences of changes to the state. 


6 Conclusions 


We presented a new approach for improving the relia- 
bility of distributed systems, where nodes predict and 
avoid inconsistencies before they occur, even if they have 
not manifested in any previous run. We believe that 
our approach is the first to give running distributed sys- 
tem nodes access to such information about their future. 
To make our approach feasible, we designed and im- 
plemented consequence prediction, an algorithm for se- 
lectively exploring future states of the system, and de- 
veloped a technique for obtaining consistent information 
about the neighborhood of distributed system nodes. Our 
experiments suggest that the resulting system, Crystal- 
Ball, is effective in finding bugs that are difficult to de- 
tect by other means, and can steer execution away from 
inconsistencies at runtime. 
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Abstract 


Replicated state machines are an important and widely- 
studied methodology for tolerating a wide range of 
faults. Unfortunately, while replicas should be dis- 
tributed geographically for maximum fault tolerance, 
current replicated state machine protocols tend to mag- 
nify the effects of high network latencies caused by ge- 
ographic distribution. In this paper, we examine how to 
use speculative execution at the clients of a replicated 
service to reduce the impact of network and protocol la- 
tency. We first give design principles for using client 
speculation with replicated services, such as generating 
early replies and prioritizing throughput over latency. We 
then describe a mechanism that allows speculative clients 
to make new requests through replica-resolved specula- 
tion and predicated writes. We implement a detailed case 
study that applies this approach to a standard Byzantine 
fault tolerant protocol (PBFT) for replicated NFS and 
counter services. Client speculation trades in 18% max- 
imum throughput to decrease the effective latency under 
light workloads, letting us speed up run time on single- 
client micro-benchmarks 1.08—19x when the client is 
co-located with the primary. On a macro-benchmark, re- 
duced latency gives the client a speedup of up to 5x. 


1 Introduction 


As more of society depends on services running on com- 
puters, tolerating faults in these services is increasingly 
important. Replicated state machines [34] provide a gen- 
eral methodology to tolerate a wide variety of faults, 
including hardware failures, software crashes, and ma- 
licious attacks. Numerous examples exist for how to 
build such replicated state machines, such as those based 
on agreement [8, 11, 22, 25] and those based on quo- 
rums [1, 11]. 

For replicated state machines to provide increased 
fault tolerance, the replicas should fail independently. 
Various aspects of failure independence can be achieved 
by using multiple computers, independently written soft- 
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ware [2, 33], and separate administrative domains. Geo- 
graphic distribution is one important way to achieve fail- 
ure independence when confronted with failures such as 
power outages, natural disasters, and physical attacks. 

Unfortunately, distributing the replicas geographically 
increases the network latency between replicas, and 
many protocols for replicated state machines are highly 
sensitive to latency. In particular, protocols that toler- 
ate Byzantine faults must wait for multiple replicas to 
reply, so the effective latency of the service is limited 
by the latency of the slowest replica being waited for. 
Agreement-based protocols further magnify the effects 
of high network latency because they use multiple mes- 
sage rounds to reach agreement. Some implementations 
may also choose to delay requests and batch them to- 
gether to improve throughput. 

Our work uses speculative execution to allow clients 
of replicated services to be less sensitive to high laten- 
cies caused by network delays and protocol messages. 
We observe that faults are generally rare, and, in the ab- 
sence of faults, the response from even a single replica 
is an excellent predictor of the final, collective response 
from the replicated state machine. Based on this observa- 
tion, clients in our system can proceed after receiving the 
first response, thereby hiding considerable latency in the 
common case in which the first response is correct, es- 
pecially if at least one replica is located nearby. When 
responses are completely predictable, clients can even 
continue before they receive any response. 

To provide safety in the rare case in which the first 
response is incorrect, a client in our system may only 
continue executing speculatively, until enough responses 
are collected to confirm the prediction. By tracking all 
effects of the speculative execution and not externaliz- 
ing speculative state, our system can undo the effects of 
the speculation if the first response is later shown to be 
incorrect. 

Because client speculation hides much of the la- 
tency of the replicated service from the client, replicated 
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servers in our system are freed to optimize their behavior 
to maximize their throughput and minimize load, such as 
by handling agreement in large batches. 

We show how client speculation can help clients of 
a replicated service tolerate network and protocol la- 
tency by adding speculation to the Practical Byzantine 
Fault Tolerance (PBFT) protocol [8]. We demonstrate 
how performance improves for a counter service and an 
NFSv2 service on PBFT from decreased effective latency 
and increased concurrency in light workloads. Specula- 
tion improves the client throughput of the counter service 
2-58 x across two different network topologies. Specu- 
lation speeds up the run time of NFS micro-benchmarks 
1.08-19x and up to 5x on a macro-benchmark when 
co-locating a replica with the client. When replicas are 
equidistant from each other, our benchmarks speed up by 
1.06-6 x and 2.2, respectively. The decrease in latency 
that client speculation provides does have a cost: under 
heavy workloads, maximum throughput is decreased by 
18%. 

We next describe our general approach to adding client 
speculation to a system with a replicated service. 


2 Client speculation in replicated services 


2.1 Speculative execution 


Speculative execution is a general latency-hiding tech- 
nique. Rather than wait for the result of a slow operation, 
a computer system may instead predict the outcome of 
that operation, checkpoint its state, and speculatively ex- 
ecute further operations using the predicted result. If the 
speculation is correct, the checkpoint is committed and 
discarded. If the speculation is incorrect, it is aborted, 
and the system rolls back its state to the checkpoint and 
re-executes further operations using the correct result. 

In general, speculative execution is beneficial only if 
the time to checkpoint state is less than the time to per- 
form the operation that generates the result. Further, the 
outcome of that operation must be predictable. Incorrect 
speculations waste resources since all work that depends 
on a mispredicted result is thrown away. This waste low- 
ers throughput, especially when multiple entities are par- 
ticipating in a distributed system, since the system might 
have been able to service other entities in lieu of doing 
work for the incorrect speculation. Thus, the decision of 
whether or not to speculate on the result of an operation 
often boils down to determining which operations will be 
slow and which slow operations have predictable results. 


2.2 Applicability to replicated services 


Replicated services are an excellent candidate for client- 
based speculative execution. Clients of replicated state 
machine protocols that tolerate Byzantine faults must 
wait for multiple replicas to reply. That may mean wait- 
ing for multiple rounds of messages to be exchanged 
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among replicas in an agreement-based protocol. If repli- 
cas are separated by geographic distances (as they should 
be in order to achieve failure independence), network 
latency introduces substantial delay between the time a 
client starts an operation and the time the client receives 
the reply that commits the operation. Thus, there is sub- 
stantial time available to benefit from speculative execu- 
tion, especially if one replica is located near the client. 

Replicated services also provide an excellent predictor 
of an operation’s result. Under the assumption that faults 
are rare, a client’s request will generate identical replies 
from every replica, so the first reply that a client receives 
is an excellent predictor of the final, collective reply from 
the replicated state machine (which we refer to as the 
consensus reply). After receiving the first reply to any 
operation, a client can speculate based on I reply with 
high confidence. For example, when an NFS client tries 
to read an uncached file, it cannot predict what data will 
be returned, so it must wait for the first reply before it 
can continue with reasonable data. 

The results of some remote operations can be pre- 
dicted even before receiving any replies; for instance, an 
NFS client can predict with high likelihood of success 
that file system updates will succeed and that read oper- 
ations will return the same (possibly stale) values in its 
cache [28]. For such operations, a client may speculate 
based on 0 replies since it can predict the result of a re- 
mote operation with high probability. 


2.3 Protocol adjustments 


Based on the above discussion, it becomes clear that 
some replicated state machine protocols will benefit 
more from speculative execution than others. For this 
reason, we propose several adjustments to protocols that 
increase the benefit of client-based speculation. 


2.3.1 Generate early replies 


Since the maximum latency that can be hidden by spec- 
ulative execution, in the absence of 0-reply speculation, 
is the time between when the client receives the first re- 
ply from any replica and when the client receives enough 
replies to determine the consensus response, a protocol 
should be designed to get the first reply to the client as 
quickly as possible. The fastest reply is realized when 
the client sends its request to the closest replica, and that 
replica responds immediately. Thus, a protocol that sup- 
ports client speculation should have one or more replicas 
immediately respond to a client with the replica’s best 
guess for the final outcome of the operation, as long as 
that guess can accurately predict the consensus reply. 
Assuming each replica stores the complete state of the 
service, the closest replica can always immediately per- 
form and respond to a read-only request. However, that 
reply is not guaranteed to be correct in the presence of 
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concurrent write operations. It could be wrong if the 
closest replica is behind in the serial order of operations 
and returns a stale value, or in quorum protocols where 
the replica state has diverged and is awaiting repair [1]. 
We describe optimizations in Section 3.2.2 that allow 
early responses from any replica in the system, along 
with techniques to minimize the likelihood of an incor- 
rect speculative read response. 

It is more difficult to allow any replica to immediately 
execute a modifying request in an agreement protocol. 
Backup replicas depend on the primary replica to de- 
cide a single ordering of requests. Without waiting for 
that ordering, a backup could guess at the order, spec- 
ulatively executing requests as it receives them. How- 
ever, itis unlikely that each replica will perceive the same 
request ordering under workloads with concurrent writ- 
ers, especially with geographic distribution of replicas. 
Should the guessed order turn out wrong (beyond ac- 
ceptable levels [23]), the replica must roll back its state 
and re-execute operations in the committed order, hurt- 
ing throughput and likely causing its response to change. 

For agreement protocols like PBFT, a more elegant so- 
lution is to have only the primary execute the request 
early and respond to the client. As we explain in Sec- 
tion 3.3, such predictions are correct unless the primary 
is faulty. This solution enables us to avoid speculation or 
complex state management on the replicas that would re- 
duce throughput. Used in this way, the primary should be 
located near the most active clients in a system to reduce 
their latency. 


2.3.2 Prioritize throughput over latency 


There exist a myriad of replicated state machine proto- 
cols that offer varying trade-offs between throughput and 
latency [1, 8, 11, 22, 30, 32, 37]. Given client support for 
speculative execution, it is usually best to choose a pro- 
tocol that improves throughput over one that improves 
latency. The reason is that speculation can do much to 
hide replica latency but little to improve replica through- 
put. 

As discussed in the previous section, speculative ex- 
ecution can hide the latency that occurs between the re- 
ceipt of an early reply from a replica and the receipt of 
the reply that ends the operation. Thus, as long as a spec- 
ulative protocol provides for early replies from the clos- 
est or primary replica, reducing the latency of the overall 
operation does not ordinarily improve user-perceived la- 
tency. 

Speculation can only improve throughput in the case 
where replicas are occasionally idle by allowing clients 
to issue more operations concurrently. If the replicas are 
fully loaded, speculation may even decrease throughput 
because of the additional work caused by mispredictions 
or the generation of early replies. Thus, it seems pru- 


dent to choose a protocol that has higher latency but 
higher potential throughput, perhaps through batching, 
and stable performance under write contention [8, 22], 
rather than protocols that optimize latency over through- 
put [1, 11]. 

An important corollary of this observation is that client 
speculation allows one to choose simpler protocols. With 
speculation, a complex protocol that is highly optimized 
to reduce latency may perform approximately the same 
as a simpler, higher latency protocol from the viewpoint 
of a user. A simpler protocol has many benefits, such 
as allowing a simpler implementation that is quicker to 
develop, is less prone to bugs, and may be more secure 
because of a smaller trusted computing base. 


2.3.3 Avoid speculative state on replicas 


To ensure correctness, speculative execution must avoid 
output commits that externalize speculative output (e.g., 
by displaying it to a user) since such output can not be 
undone once externalized. The definition of what consti- 
tutes external output, however, can change. For instance, 
sending a network message to another computer would 
be considered an output commit if that computer did not 
support speculation. However, if that computer could be 
trusted to undo, if necessary, any changes that causally 
depend on the receipt of the message, then the message 
would not be an output commit. One can think of the 
latter case as enlarging the boundary of speculation from 
just a single computer to encompass both the sender and 
receiver. 

What should be the boundary of speculation for a 
replicated service? At least three options are possible: 
allow all replicas and clients of the service to share spec- 
ulative state, allow replicas to share speculative state with 
individual clients but not to propagate one client’s spec- 
ulative state to other clients, and disallow replicas from 
storing speculative state. 

Our design uses the third option, with the smallest 
boundary of speculation, for several reasons. First, the 
complexity of the system increases as more parts partic- 
ipate in a speculation. The system would need to use 
distributed commit and rollback [14] to involve replicas 
and other clients in the speculation, and the interaction 
between such a distributed commit and the normal repli- 
cated service commit would need to be examined care- 
fully. Second, as the boundary of speculation grows 
larger, the cost of a misprediction is higher; all repli- 
cas and clients that see speculative state must roll back 
all actions that depend on that state when a prediction is 
wrong. Finally, it may be difficult to precisely track de- 
pendencies as they propagate through the data structures 
of a replica, and any false dependencies in a replica’s 
state may force clients to trust each other in ways not re- 
quired by the data they share in the replicated service. 
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For example, if the system takes the simple approach of 
tainting the entire replica state, then one client’s mispre- 
diction would force the replica to roll back all later oper- 
ations, causing unrelated clients to also roll back. 


2.3.4 Use replica-resolved speculation 


Even with this small boundary of speculation, we would 
still like to allow clients to issue new requests that de- 
pend on speculative state (which we call speculative re- 
quests). Speculative requests allow a client to continue 
submitting requests when it would otherwise be forced to 
block. These additional requests can be handled concur- 
rently, increasing throughput when the replicas are not 
already fully saturated. 


One complication here is that, to maintain correctness, 
if one of the prior operations on which the client is spec- 
ulating fails, any dependent operations that the client is- 
sues must also abort. There is currently no mechanism 
for a replica to determine whether or not a client received 
a correct speculative response. Thus, the replica is un- 
able to detect whether or not to execute subsequent de- 
pendent speculative requests. 


To overcome this flaw, we propose replica-resolved 
speculation through predicated writes, in which replicas 
are given enough information to determine whether the 
speculations on which requests depend will commit or 
abort. With predicated writes, an operation that modifies 
state includes a list of the active speculations on which 
it depends, along with the predicted responses for those 
speculations. Replicas log each committed response they 
send to clients and compare each predicted response in 
a predicated write with the actual response sent. If all 
predicated responses match the saved versions, the spec- 
ulative request is consistent with the replica’s responses, 
and it can execute the new request. If the responses do 
not match, the replica knows that the client will abort 
this operation when rolling back a failed speculation, so 
it discards the operation. This approach assumes a pro- 
tocol in which all non-faulty replicas send the same re- 
sponse to a request. 


Note that few changes may need to be made to a pro- 
tocol to handle speculative requests that modify data. An 
operation O that depends on a prior speculation O,, with 
predicted response 7, may simply be thought of as a sin- 
gle deterministic request to the replicated service of the 
predicated form: if response(O,) = r, then do O. 
This predicate must be enforced on the replicas. How- 
ever, as shown in Section 5, predicate checking may be 
performed by a shim layer between the replication pro- 
tocol and the application without modifying the protocol 
itself. 
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Figure 1: PBFT-CS Protocol Communication. The early 
response from the primary is shown with a dashed hollow 
arrow, which replaces its response from the Reply phase 
(dotted filled arrow) in PBFT. 


3 Client speculation for PBFT 


In this section, we apply our general strategy for support- 
ing client speculative execution in replicated services to 
the Practical Byzantine Fault Tolerance (PBFT) protocol. 
We call the new protocol we develop PBFI-CS (CS de- 
notes the additional support for client speculation). 


3.1 PBFT overview 


PBFT is a Byzantine fault tolerant state machine repli- 
cation protocol that uses a primary replica to assign 
each client request a sequence number in the serial or- 
der of operations. The replicas run a three-phase agree- 
ment protocol to reach consensus on the ordering of each 
operation, after which they can execute the operation 
while ensuring consistent state at all non-faulty repli- 
cas. Optionally, the primary can choose and attach non- 
deterministic data to each request (for NFS, this contains 
the current time of day). 

PBFT requires 3f + 1 replicas to handle f concurrent 
faulty replicas, which is the theoretical minimum [5]. 
The protocol guarantees liveness and correctness with 
up to f failures, and runs a view change sub-protocol to 
move the primary to another replica in the case of a bad 
primary. 

The communication pattern for PBFT is shown in Fig- 
ure 1. The client normally receives a commit after five 
one-way message delays, although this may be short- 
ened to four delays by overlapping the commit and re- 
ply phases using a tentative execution optimization [8]. 
To reduce the overhead of the agreement protocol, the 
primary may collect a number of client requests into a 
batch and run agreement once on the ordering of opera- 
tions within this batch. 

In our modified protocol, PBFI-CS, the primary re- 
sponds immediately to client requests, as illustrated by 
the dashed line in Figure 1. 


3.2 PBFT-CS base protocol 


In both PBFT and PBFT-CS, the client sends each re- 
quest to all replicas, which buffer the request for execu- 
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tion after agreement. Unlike the PBFT agreement proto- 
col, the primary in PBFT-CS executes an operation im- 
mediately upon receiving a request and sends the early 
reply to the client as a speculative response. The primary 
then forms a pre-prepare message for the next batch of 
requests and continues execution of the agreement proto- 
col. Other replicas are unmodified and reply to the client 
request once the operation has committed. 

Since the primary determines the serial ordering of all 
requests, under normal circumstances the client will re- 
ceive at least f committed responses from the replicas 
matching the primary’s early response. This signifies that 
the speculation was correct because the request commit- 
ted with the same value as the speculative response. If 
the client receives f + 1 matching responses that differ 
from the primary’s response, the client rolls back the cur- 
rent speculation and resumes execution with the consen- 
sus response. 


3.2.1 Predicated writes 


A PBFT-CS client can issue subsequent requests imme- 
diately after predicting a response to an earlier request, 
rather than waiting for the earlier request to commit. To 
enable this without requiring replicas themselves to spec- 
ulate and potentially roll back, PBFI-CS ensures that a 
request that modifies state does not commit if it depends 
on the value of any incorrect speculative responses. To 
meet this requirement, clients must track and propagate 
the dependencies between requests. 

For example, consider a client that reads a value stored 
in a PBFT-CS database (op1), performs some computa- 
tion on the data, then writes the result of the computa- 
tion back to the database (op2). If the primary returns 
an incorrect speculative result for op1, the value to be 
written in op2 will also be incorrect. When op1 eventu- 
ally commits with a different value, the client will fail its 
speculation and resume operation with the correct value. 
Although the client cannot undo the send of op2, depen- 
dency tracking prevents op2 from writing its incorrect 
value to the database. 

Each PBFT-CS client maintains a log of the digests dr 
of each speculative response issued at logical timestamp 
T’. When an operation commits, its corresponding digest 
is removed from the tail of the log. If an operation aborts, 
its digest is removed from the log, along with the digests 
of any dependent operations. 

Clients append any required dependencies to each 
speculative request, of the form {c, (t;,d;), ...} for client 
c and each digest d; at timestamp 1;. 

Replicas also store a log of digests for each client with 
the committed response for each operation. The replica 
executes a speculative request only if all digests in the re- 
quest’s dependency list match the entries in the replica’s 
log. Otherwise, the replica executes a no-op in place of 


the operation. 

It is infeasible for replicas to maintain an unbounded 
digest log for each client in a long-running system, so 
PBFI-CS truncates these logs periodically. Replicas 
must make a deterministic decision on when to truncate 
their logs to ensure that non-faulty replicas either all ex- 
ecute the operation or all abort it. This is achieved by 
truncating the logs at fixed deterministic intervals. 

If a client issues a request containing a dependency 
that has since been discarded from the log, the repli- 
cas abort the operation, replacing it with a no-op. The 
client recognizes this scenario when receiving a consen- 
sus response that contains a special retry result. It retries 
execution once all its dependencies have committed. In 
practice an operation will not abort due to missing de- 
pendencies, provided that the log is sufficiently long to 
record all operations issued in the time between a replica 
executing an operation and a quorum of responses being 
received by the client. 


3.2.2 Read-only optimization 


Many state machine replication protocols provide a read- 
only optimization [1, 8, 11, 22] in which read requests 
can be handled by each replica without being run through 
the agreement protocol. This allows reads to complete in 
a single communication round, and it reduces the load on 
the primary. 

In the standard optimization, a client issues optimized 
read requests directly to each replica rather than to the 
primary. Replicas execute and reply to these requests 
without taking any steps towards agreement. A client 
can continue after receiving 2f/ + 1 matching replies. 
Because optimized reads are not serialized through the 
agreement protocol, other clients can issue conflicting, 
concurrent writes that prevent the client from receiving 
enough matching replies. When this happens, the client 
retransmits the request through the agreement protocol. 
This optimization is beneficial to workloads that con- 
tain a substantial percentage of read-only operations and 
exhibit few conflicting, concurrent writes. Importantly, 
when a backup replica is located nearer a client than the 
primary, that replica’s reply will typically be received by 
the client before the primary’s. 

PBFIT-CS cannot use this standard optimization with- 
out modification. A problem arises when a client is- 
sues a speculative request that depends on the predicted 
response to an optimized read request. PBFI-CS re- 
quires all non-faulty replicas to make a deterministic de- 
cision when verifying the dependencies on an operation. 
However, since optimized reads are not serialized by the 
agreement protocol, one non-faulty replica may see a 
conflicting write before responding to an optimized read, 
while another non-faulty replica sees the write after re- 
sponding to the read. These two non-faulty replicas will 
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thus respond to the optimized read with different values, 
and they will make different decisions when they verify 
the dependencies on a later speculative request. A non- 
faulty replica that sent a response that matches the first 
speculative response received by the client will commit 
the write operation, while other non-faulty replicas will 
not. Hence, writes may not depend on uncommitted op- 
timized reads. This is enforced at each replica by not 
logging the response digest for such requests. 

We address this problem by allowing a PBFI-CS 
client to resubmit optimized read requests through the 
full agreement protocol, forcing the replicas to agree on 
a common response. When write conflicts are low, the re- 
submitted read is likely to have the same reply as the ini- 
tial optimized read, so a speculative prediction is likely 
to still be correct. After performing this procedure, we 
can send any dependent write requests, as they no longer 
depend on an optimized request. 

There are three issues that must be considered for a 
read request to be submitted using this optimization. 


e The request cannot read uncommitted state. 
e The client should not follow a read with a write. 
e The reply should not be completely predictable. 


The first issue is required for consistency. A client 
cannot optimize a read request for a piece of state before 
all its write requests for that state are committed. Other- 
wise, it risks reading stale data when a sufficient number 
of backup replicas have not yet seen the client’s previous 
writes. The data dependency tracking required to imple- 
ment this policy is also used to propagate speculations, so 
no extra information needs to be maintained. Reads that 
do depend on uncommitted data may still be submitted 
through the agreement protocol as with write requests. 
Should a client desire a simpler policy for ensuring cor- 
rectness, it can disable the read-only optimization while 
it has any uncommitted writes. 

Second, consider a client that reads a value, performs 
a computation, and then writes back a new value. If the 
read request is initially sent optimized, issuing the write 
will force the read to be resubmitted. The “‘optimization” 
results in additional work. Clients that anticipate follow- 
ing a read by a write should decline to optimize the read. 

Finally, if a client can predict the outcome of the re- 
quest before receiving any replies (for instance, if it pre- 
dicts that a locally-cached value has not become stale), 
then it should submit the request through the normal 
agreement protocol. Since the client does not need to 
wait for any replies, it is not hurt by the extra latency of 
waiting for agreement. 


3.3. Handling failures 


Speculation optimizes for reduced latency in the non- 
failure case, but it is important to ensure that correct- 
ness and liveness are maintained in the presence of faulty 
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replicas. Failed speculations also increase the latency 
of a client’s request, forcing it to roll back after having 
waited for the consensus response, and hurt throughput 
by forcing outstanding requests to become no-ops. It is 
important for our protocol to handle faults correctly in a 
way that still tries to preserve performance. 

A speculation will fail on a client when the first re- 
ply it receives to a request does not match the consensus 
response. There are three cases in which this might hap- 
pen: 

e The most common case occurs when a write issued 
by another client conflicts with an optimized read. 
In an extreme instance, one replica’s early reply 
could contain the stale data while all other replicas 
reply with current data. 


e The second case occurs when there is a view 
change. PBFT ensures that committed requests 
will be ordered the same in the new view, but 
the client is speculating on uncommitted requests 
that the new replica could order differently. View 
changes may be the result of a bad primary, or they 
may be triggered by network conditions or proac- 
tive recovery [9]. 


e The third case occurs when the primary 1s faulty, 
and it either returns an incorrect speculative re- 
sponse or serializes a request differently when run- 
ning the agreement protocol. We next examine this 
scenario further. 


It is trivial for a client to detect a faulty primary: a 
request’s early reply from the primary and the consensus 
reply will be in the same view and not match. If signed 
responses are used, the primary’s bad reply can be given 
to other replicas as a proof of misbehavior. However, if 
simple message authentication codes (MACs) are used, 
the early reply cannot be used in this way since MACs 
do not provide non-repudiation. 

The simplest solution to handling faults with MACs is 
for a client to stop speculating if the percentage of failed 
speculations it observes surpasses a threshold. PBFI- 
CS currently uses an arbitrary threshold of 1%. Ifa 
client observes that the percentage of failed speculations 
is greater than 1% over the past n early replies provided 
by a replica, it simply ceases to speculate on subsequent 
early replies from that replica. Although it will not spec- 
ulate on subsequent replies, it can still track their accu- 
racy and resume speculating on further replies if the per- 
centage falls below a threshold. Our experimental results 
verify that at this threshold, PBFT-CS is still effective at 
reducing the average latency under light workloads. 


3.4 Correctness 


The speculative execution environment and PBFT proto- 
col used in our system both have well-established cor- 
rectness guarantees [7, 28]. We thus focus our attention 
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on the modifications made to PBFT, to ensure that this 
protocol remains correct. 
Our modified version of PBFT differs from the origi- 
nal in several key ways: 
e A client may be sent a speculative response that 
differs from the final consensus value. 


e A client may submit an operation that depends on 
a failed speculation. 


e The primary may execute an operation before it 
commits. 
We evaluate each modification independently. 


Incorrect speculation A bad primary may send an in- 
correct speculative response to a client, in that it differs 
on the value or ordering of the final consensus value. We 
also consider in this class an honest primary that sends a 
speculative response to a client but is unable to complete 
agreement on this response due to a view change. In ei- 
ther case, the client will only see the consensus response 
once the operation has undergone agreement at a quorum 
of replicas. If the speculative response was incorrect, it 
is safe for the client to roll back the speculative execu- 
tion and re-run using the consensus value, since PBFT 
ensures that all non-faulty replicas will agree on the con- 
sensus value. 


Dependent operations A further complication arises 
when the client has issued subsequent requests that de- 
pend on the value of a speculative response. Here, the 
speculation protocol on the client ensures that it rolls 
back execution of any operations that have dependencies 
on the failed speculation. We must ensure that all valid 
replicas make an identical decision to abort each depen- 
dent operation by replacing it with a no-op. 

Replicas maintain a log of the digests for each com- 
mitted operation and truncate this log at deterministic 
intervals so that all non-faulty replicas have the same 
log state when processing a given operation. Predicated 
writes in PBFT-CS allow the client to express the specu- 
lation dependencies to the replicas. A non-faulty replica 
will not execute any operation that contains a depen- 
dency that does not match the corresponding digest in 
the log, or that does not have a matching log entry. Since 
the predicated write contains the same information used 
by the client when rolling back dependent operations, the 
replicas are guaranteed to abort any operation aborted by 
the client. If a client submits a dependency that has since 
been truncated from the log, it will also be aborted. 

The only scenario where replicas are unable to de- 
terministically decide whether a speculative response 
matches its agreed-upon value is when a speculative re- 
sponse was produced using the read-only optimization. 
Here, different replicas may have responded with differ- 
ent values to the read request. We explicitly avoid this 
case by making it an error to send a write request that de- 


pends on the reply to an optimized read request; correct 
clients will never issue such a request. Replicas do not 
store the responses to optimized reads in their log and 
hence always ignore any request sent by a faulty client 
with a dependency on an optimized read. 


Speculative execution In our modified protocol, the 
primary executes client requests immediately upon their 
receipt, before the request has undergone agreement. The 
agreement protocol dictates that all non-faulty replicas 
commit operations in the order proposed by the primary, 
unless they execute a view change to elect a new pri- 
mary. After a view change, the new primary may reorder 
some uncommitted operations executed by the previous 
primary, however, the PBFT view change protocol en- 
sures that any committed operations persist into the new 
view. It is safe for the old primary to restore its state to 
the most recent committed operation since any incorrect 
speculative response will be rolled back by clients where 
necessary. 


4 Discussion and future optimizations 


In this section, we further explore the protocol design 
space for the use of client speculation with PBFT. We 
compare and contrast possible protocol alternatives with 
the PBFT-CS protocol that we have implemented. 


4.1 Alternative failure handling strategies 


We considered two alternative strategies for dealing with 
faulty primaries. First, we could allow clients to request 
a view change without providing a proof of misbehav- 
ior. This scheme would seem to significantly compro- 
mise liveness in a system containing faulty clients since 
they can force view changes at will. However, this is an 
existing problem in BFT state machine replication in the 
absence of signatures. A bad client in PBFT is always 
able to force a view change by sending a request to the 
primary with a bad authenticator that appears correct to 
the primary or by sending different requests to different 
replicas [7]. We could mitigate the damage a given bad 
client can do by having replicas make a local decision to 
ignore all requests from a client that ‘framed’ them. In 
this way a bad client can not initiate a view change after 
incriminating f primaries. 

Alternatively, we could require signatures in commu- 
nications between client and replicas. This is the most 
straight-forward solution, but entails significant CPU 
overhead. Compared to these two alternative designs, 
we chose to have PBFT-CS revert to a non-speculative 
protocol due to the simplicity of the design and higher 
performance in the absence of a faulty primary. 


4.2 Coarse-grained dependency tracking 


PBFT-CS tracks and specifies the dependencies of a 
speculative request at fine granularity. Thus, message 
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size and state grow as the average number of dependen- 
cies for a given operation increases. To keep message 
size and state constant, we could use coarser-grained de- 
pendencies. 

We could track dependencies on a per-client basis by 
ensuring that a replica executes a request from a client at 
logical timestamp T’ only if a// outstanding requests from 
that client prior to time 7’ have committed with the same 
value the client predicted. 

Instead of maintaining a list of dependencies, each 
client would instead store a hash chained over all consen- 
sus responses and subsequent speculative responses. The 
client would append this hash to each operation in place 
of the dependency list. The client would also keep an- 
other hash chained only over consensus responses, which 
it would use to restore its dependency state after rolling 
back a failed speculation. 

Each replica would maintain a hash chained over re- 
sponses sent to the client and would execute an opera- 
tion if the hash chain in the request matches its record of 
responses. Otherwise, it would execute a no-op. 

We chose not to use this optimization in PBFI-CS 
since the use of chained hashes creates dependencies be- 
tween all operations issued by a client even when no 
causal dependencies exist. This increases the cost of a 
failed speculation since the failure of one speculative re- 
quest causes all subsequent in-progress speculative oper- 
ations to abort. Coarse-grained dependency tracking also 
limits the opportunities for running speculative read op- 
erations while there are active speculative writes. Since 
speculative read responses are not serialized with respect 
to write operations, it is likely that the client will insert 
the read response in the wrong point in the hash chain, 
causing subsequent operations to abort. 


4.3 Reads in the past 


A read-only request need not circumvent the agree- 
ment protocol completely, as described in section 3.2.2. 
A client can instead take a hybrid approach for non- 
modifying requests: it can submit the request for full 
agreement and at the same time have the nearest replica 
immediately execute the request. 

If the primary happens to be the nearest to the client, 
this is not a change from the normal protocol. When an- 
other replica is closer, the client can get a lower-latency 
first reply, plus having agreement eliminates the second 
consideration for optimized reads (in Section 3.2.2), that 
a client should not follow a read with a write. 

However, this new optimization presents a problem 
when there are concurrent writes by multiple clients. A 
non-primary replica will execute an optimized request, 
and a client will speculate on its reply, in a sequential or- 
der that is likely different from the request’s actual order 
in the agreement protocol. In essence, the read has been 
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executed in the past, at a logical time when the replicas 
have not yet processed all operations that are undergoing 
agreement but when they still share a consistent state. 

We could extend the PBFT-CS read-only optimization 
to also allow reads in the past. Under a typical configu- 
ration, there is only one round of agreement executing at 
any one time, with incoming requests buffered at the pri- 
mary to run in the next batch of agreement. If we were to 
ensure that all buffered reads are reordered, when possi- 
ble, to be serialized at the start of this next batch, it would 
be highly likely that no write will come between a read 
being received by a replica and the read being serialized 
after agreement. 

Note that the primary may assign any order to requests 
within a batch as long as no operation is placed before 
one on which it depends. Recall that a PBFT-CS client 
will only optimize a read if the read has no outstanding 
write dependencies. Hence, the primary is free to move 
all speculative reads to the start of the batch. The primary 
executes these requests on a snapshot of the state taken 
before the batch began. 


5 Implementation 


We modified Castro and Liskov’s PBFT library, /ib- 
byz [8], to implement the PBFI-CS protocol described 
in Section 3. We also modified BFS [8], a Byzantine- 
fault-tolerant replicated file service based on NFSv2, to 
support client speculation. The overall system can be 
divided into three parts as shown in Figure 2: the NFS 
client, a protocol relay, and the fault-tolerant service. 


5.1 NES client operation 


Our client system uses the NFSv2 client module of 
the Speculator kernel [28], which provides process-level 
support for speculative execution. Speculator supports 
fine-grained dependency tracking and checkpointing of 
individual objects such as files and processes inside the 
Linux kernel. Local file systems are speculation-aware 
and can be accessed without triggering an output com- 
mit. Speculator buffers external output to the termi- 
nal, network, and other devices until the speculations on 
which they depend commit. Speculator rolls back pro- 
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cess and OS state to checkpoints and restarts execution if 
a speculation fails. 

To execute a remote NFS operation, Speculator first 
attaches a list of the process’s dependencies to the mes- 
sage, then sends it to a relay process on the same ma- 
chine. The relay interprets this list and attaches the cor- 
rect predicates when sending the PBFIT-CS request. 

The relay brokers communication between the client 
and replicas. It appears to be a standard NFS server to 
the client, so the client need not deal with the PBFT-CS 
protocol. When the relay receives the first reply to a 1- 
reply speculation, the reply is logged and passed to the 
waiting NFS client. The NFS client recognizes specula- 
tive data, creates a new speculation, and waits for a con- 
firmation message from the relay. Once the consensus 
reply is known, the relay sends either a commit mes- 
sage or a rollback{reply} message containing the 
correct response. 

Our implementation speculates based on O replies for 
GETATTR, SETATTR, WRITE, CREATE, and REMOVE calls. 
It can speculate on | reply for GETATTR, LOOKUP, and 
READ calls. This list includes the most common NES 
operations: we observed that at least 95% of all calls in 
all our benchmarks were handled speculatively. Note that 
we speculate on both 0 replies and | reply for GETATTR 
calls. The kernel can speculate as soon as it has attributes 
for a file. When the attributes are cached, 0 replies are 
needed, otherwise, the kernel waits for 1 reply before 
continuing. 


5.2. PBFI-CS client operation 


Speculation hides latency by allowing a single client to 
pipeline many requests; however, our PBFT implemen- 
tation only allows for each PBFI-CS client to have a sin- 
gle outstanding request at any time. We work around 
this limitation by grouping up to 100 logical clients into 
a single client process. 

NFS with O-reply speculation requires its requests to 
be executed in the order they were issued. A PBFI-CS 
client process can tag each request with a sequence num- 
ber so that the primary replica will only process requests 
from that client process’s logical clients in the correct or- 
der. Of course, two different clients’ requests can still be 
interleaved in any order by the primary. 

To support this additional concurrency, we designed 
the client to use an event-driven API. User programs pass 
requests to libbyz and later receive two callbacks: one 
delivers the first reply and another delivers the consensus 
reply. The user program is responsible for monitoring 
libbyz’s communication channels and timers. 


5.3. Server operation 


On the replicas, libbyz implements an event-based server 
that performs upcalls into the service when needed: to re- 
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Figure 3: Server throughput in a LAN, measured on the 
shared counter service. PBFI-CS (4) is limited to four 
concurrent requests. 
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Table 1: Major sources of overhead affecting throughput 
for PBFT-CS relative to PBFT. 


quest non-deterministic data, to execute requests, and to 
construct error replies. The library handles all commu- 
nication and state management, including checkpointing 
and recovery. 

A shim layer is used to manage dependencies on repli- 
cas. When writes need to be quashed due to failed specu- 
lative dependencies, the shim layer issues a no-op to the 
service instead. Thus, the underlying service is not ex- 
posed to details of the PBFT-CS protocol. 

The primary will batch together all requests it receives 
while it is still agreeing on earlier requests. Batch- 
ing is a general optimization that reduces the number 
of protocol instances that must be run, decreasing the 
number of communications and authentication opera- 
tions [8, 22, 23, 37]. This implementation imposes a 
maximum batch size of 64 requests, a limit our bench- 
marks do run up against. 


6 Evaluation 


In this section, we quantify the performance of our 
PBFIT-CS implementation. We have implemented a sim- 
ple shared counter micro-benchmark and several NFS 
micro- and macro-benchmarks. 

We compare PBFI-CS against two other Byzan- 
tine fault-tolerant agreement protocols: PBFT [8] and 
Zyzzyva [22]. PBFT is the base protocol we extend make 
use of client speculation. Its overall structure is illus- 
trated in Figure 1. We use the tentative reply optimiza- 
tion, so each request must go through 4 communication 
phases before the client acquires a reply that it can act on. 
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Figure 4: Time taken to run 2000 updates using the 
shared counter service. The primary-local topology (a) 
shows a client located at the same site as the primary. 
The uniform topology (b) shows a remote client equidis- 
tant from all sites. 0 ms (LAN) times for both graphs are 
(in bar order): 0.36 s, 0.27 s, 0.41 s, 0.54 s, and 0.16 s. 


PBFT uses an adaptive batching protocol, allowing up to 
64 requests to be handled in one agreement instance. 

Zyzzyva 1S a recent agreement protocol that is heavily 
optimized for failure-free operation. When all replicas 
are non-faulty (as in our experiments), it takes only 3 
phases for a client to possess a consensus reply. We run 
Kotla et al.’s implementation of Zyzzva, which uses a 
fixed batch size. We simulate an adaptive batching strat- 
egy by manually tuning the batch size as needed for best 
performance. 

By comparison, a PBFI-CS client can continue exe- 
cuting speculatively after only 2 communication phases. 
We expect this to significantly reduce the effective la- 
tency of our clients. Note that requests still require 4 
phases to commit, but we can handle those requests con- 
currently rather than sequentially. If we limit the number 
of in-flight requests to some number n, we call the pro- 
tocol “PBFT-CS (n).” 


6.1 Experimental setup 


Each replica machine uses a single Intel Xeon 2.8 GHz 
processor with 512 MB RAM (sufficient for our appli- 
cations). We always evaluate using four replicas without 
failures (unless noted). In our NFS comparisons, we use 
a single client that is identical in hardware to the replicas. 
Our counter service runs on an additional five client ma- 
chines using Intel Pentium 4s or Xeons with clock speeds 
of 3.06—3.20 GHz and 1 GB RAM. AIl systems use a 
generic Red Hat Linux 2.4.21 kernel. 

Our machines use gigabit Ethernet to communicate di- 
rectly with a single switch. Experiments using the shared 
counter service were performed on a Cisco Catalyst 2970 
gigabit switch; NFS used an Intel Express ES1O1TX 
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10/100 switch. 

Our target usage scenario is a system that consists of 
several sites joined by moderate latency connections (but 
slower than LAN speeds). Each site has a high-speed 
LAN hosting one replica and several clients, and clients 
may also be located off-site from any replica. For com- 
parison with other agreement protocols, we also consider 
using PBFI-CS in a LAN setting where all replicas and 
clients are on the same local segment. 

Based on the above scenarios, we emulate a simpli- 
fied test network using NISTNet [6] that inserts an equal 
amount of one-way latency between each site. We let this 
inserted delay be either 2.5 ms or 15 ms. 

We also measure performance at clients located in dif- 
ferent areas in our scenario. In the primary-local topol- 
ogy, the client is at the same site as the current primary 
replica. The primary-remote topology considers a client 
at different site hosting a backup replica. A client not 
present at any site is shown in the uniform topology, and 
we let the client have the same one-way latency to all 
replicas as between sites. 

When comparing against a service with no replication 
in a given topology, we always assume that a client at a 
site can access its server using only the LAN. A client 
not at a site is still subject to added delay. 


6.2 Counter throughput 


We first examine the throughput of PBFT-CS using the 
counter service. Similar to Castro and Liskov’s standard 
0/0 benchmark [8], the counter’s request and reply size 
are minimal. This service exposes only one operation: 
increment the counter and return its new value. Each 
reply contains a token that the client must present on its 
next request. This does add a small amount of processing 
time to each request, but it ensures that client requests 
must be submitted sequentially. 

Our client is a simple loop that issues a fixed num- 
ber of counter updates and records the total time spent. 
No state is externalized by the client, so we allow the 
client process to implement its own lightweight check- 
point mechanism. Checkpoint operations take negligible 
time, so our results focus on the characteristics of the 
protocol itself rather than our checkpoint mechanism. 

We measure throughput by increasing the number of 
client processes per machine (up to 17 processes) until 
the server appears saturated. Graphs show the mean of at 
least 6 runs, and visible differences are statistically sig- 
nificant. 

Figure 3 shows the measured throughput in a LAN 
configuration. We found that in this topology, a sin- 
gle PBFT-CS client gains no benefit from having more 
than 4 concurrent requests, and we enforce that limit 
on all clients. When we have 12 or fewer concur- 
rent clients, PBFI-CS has 1.19-1.49x higher through- 
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Figure 5: Read-only NFS micro-benchmark performance across different network topologies. The last three data sets 
use Q-reply speculation. At 0 ms, all three topologies are equivalent, so the same data is used for each graph. The no 
rep data show a lower bound for run time. There is only one no rep data set for primary-local and primary remote 
topologies, because the location of the server does not change with increasing latency. For these two graphs, the 0 ms 


bar applies to all latencies but is not repeated. 


put than Zyzzyva and 1.79-2x higher throughput than 
PBFT. 

In lightly loaded systems, the servers are not being 
fully utilized, and speculating clients can take advantage 
of the spare resources to decrease their own effective la- 
tency. As the server becomes more heavily loaded, those 
resources are no long free to use. As a result, PBFI-CS 
reaches its peak throughput before other protocols. 

There is a trade-off of throughput for latency: PBFT- 
CS shows a peak throughput that is 17.6% lower than 
PBFT. We found four fundamental sources of overhead, 
summarized in Table 1. First, the client implementa- 
tion for PBFI-CS uses an event-driven system to han- 
dle several logical clients, needed to support concurrent 
requests. This design does lead to a slower client than 
the one in PBFT, which can get by with a simpler block- 
ing design. Second, we found that having the primary 
send early replies increases its time spent blocking while 
transmitting. Third, each predicate added to a request 
makes the request packet larger, and fourth, those predi- 
cates take additional work to verify on each replica. 


6.3 Counter latency 


We next examine how latency affects client performance 
under a light workload when the client is located at dif- 
ferent sites. Figure 4 shows the time taken for a single 
counter client to issue 2000 requests in different topolo- 
gies. In the LAN topology where no delay is added, a 
PBFI-CS client is able to complete the benchmark in 
33% less time than PBFT, reflecting average run times 
of 357 ms and 538 ms respectively. When we increase 
the latency between sites, run time becomes dominated 
by number of communication phases. With a uniform 
topology (Figure 4b), PBFT-CS takes 50% less time than 


PBFT and 33% less time than Zyzzyva, and its runtime 
is only 1% slower than the unreplicated service. This 
matches our intuitive understanding of the protocol be- 
havior described at the start of this section. 

For PBFT-CS, the critical path is a round-trip commu- 
nication with the primary replica. Moving to a primary- 
remote topology (bringing one backup replica closer) 
does not affect this critical path, and our measurements 
show no significant difference between primary-remote 
and uniform topologies. 

Figure 4a presents results when using a primary-local 
topology. As latency increases and backup replicas move 
further from the client, performance does not degrade 
significantly, since the latency to the primary is fixed. 
At 15 ms latency, a client using PBFT takes 58x longer 
than with PBFI-CS. The combination of client specu- 
lation and a co-located primary achieves much of the 
performance benefit of a closely located non-replicated 
server, while providing all the guarantees of a geograph- 
ically distributed replicated service that tolerates Byzan- 
tine faults. 

These significant gains are directly attributable to 
the increased concurrency possible in the primary-local 
topology. When we limit PBFT-CS to only 4 outstanding 
requests, the client must then wait on requests to commit, 
reintroducing a dependence on communication delay. In 
topologies where the client does not have privileged ac- 
cess to the primary, as in the uniform topology, limiting 
concurrency has little effect. 


6.4 NFS 


We next examine PBFI-CS applied to an NFS server. 
Considering that the NFSv2 protocol is not explicitly 
designed for high-latency environments, we compare 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 


255 


6 6 6 
2 [™ PBFT 
2A 4 4 [1 PBFT + 0-spec 
2 [|] PBFT-CS 
= ; ; , Cai No rep + 0-spec 

0 0 0 

0 2.5 15 0 2.55 15 0 2). 15 
Network delay (ms) 
(a) Primary-local (b) Primary-remote (c) Uniform 


Figure 6: Write-only NFS micro-benchmark. 
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Figure 7: Read/Write NFS micro-benchmark. 


against the variation of NFS that uses 0-reply specula- 
tion. All benchmarks begin with a freshly-mounted file 
system and an empty cache. 

Unlike the counter service, this application has over- 
head associated with creating, committing, and rolling 
back to a checkpoint. Processes may have computation 
to perform between requests, and they may need to block 
before an output commit. 

For comparison with non-speculative systems, we 
measure the performance of NFS under PBFT. Using 
our speculative NFS protocol, we measure PBFT using 
only 0-reply speculation (PBFT + 0-spec) and PBFI-CS. 
The difference between these two measurements show 
the benefit of 1-reply speculation. As a lower bound, we 
also measure the performance of a non-replicated NFS 
server that uses 0-reply speculation (No rep + O-spec). 

We use a vanilla kernel for evaluating non-speculative 
PBFT with a slight modification that increases the num- 
ber of concurrent RPC requests allowed. Other bench- 
marks use the Speculator kernel. 

In the no replication configuration, the NFS client uses 
a thin UDP relay on the local machine that stands in for 
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the BFT relay. 

Our modifications to the NFS client, the relay, and 
the replicated service have introduced additional over- 
head that is not present in the original PBFT. This inef- 
ficiency is particularly apparent in our O ms topologies, 
where PBFT-CS shows a 1.03—2.18 x slowdown relative 
to PBFT across all our benchmarks. However, in all 
cases at higher latencies, client speculation results in a 
clear improvement, and we primarily address these con- 
figurations in the following sections. 

At the time of publication, we had not yet ported our 
NES server to use the Zyzzyva protocol, so we regret- 
fully are unable to provide a direct comparison for these 
benchmarks. 

All graphs show the mean of at least five measure- 
ments. Error bars are shown when the 95% confidence 
interval is above 1% of the mean value. 


6.5 NFS: Read-only micro-benchmark 


We first ran a read-only micro-benchmark that greps 
for a common string within the Linux headers. The total 
size of the searched files is about 9.1 MB. Most requests 
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Figure 8: The Apache build NFS benchmark measures how long it takes to compile and link Apache 2.0.48. 


in this benchmark are read-only and are optimized to cir- 
cumvent agreement. 

Figure 5 shows that PBFT takes 2.06 longer to com- 
plete than PBFT-CS at 15 ms. 0-reply speculation lets the 
client avoid blocking when revalidating a file after open- 
ing it. With PBFI-CS, we can additionally read from 
a file without delay: a nearby replica supplies all the 
speculative data. Without a nearby replica (in uniform 
topology), 1-reply speculation is not beneficial since op- 
timized reads complete at about the same time the client 
gets its first reply. 


6.6 NFS: Write-only micro-benchmark 


We next ran a write-only micro-benchmark that writes 
3.9 MB into an NFS file (Figure 6). All writes are issued 
asynchronously by the file system, and the client only 
blocks when the file is closed. In this case, speculation is 
not needed to increase the parallelism of the system. 

There are a very small number of read requests in this 
benchmark, issued when first opening a file, so there 
is no practical opportunity to use l-reply speculation. 
Speculation at 2.5 ms reduces the benchmark run time 
by only 6—7%. We found that within each latency (ir- 
respective of topology), there is no statistical difference 
between PBFT+0-spec and PBFI-CS. 


6.7. NES: Read/write micro-benchmark 


We next ran a read/write micro-benchmark that creates 
100 4 KB files in a directory. For each file, the client 
creates and writes to a file; this includes read-only op- 
erations to read the directory entries. PBFI-CS never 
blocks on any of these operations. 

In the primary-local topology, PBFT takes up to 19x 
longer to complete than PBFI-CS (Figure 7). Further- 
more, PBFT-CS shows a resilience to changes in latency 
as it increases from 0-15 ms: PBFI-CS execution time 
doubles while PBFT takes 59x longer. On the primary- 
remote and uniform topologies, operations take longer to 


complete, but client speculation still speeds up run time 
by 6.03 x. 


6.8 NFS: Apache build macro-benchmark 


Finally, we ran a benchmark that compiles and links 
Apache 2.0.48. This emulates the standard Andrew-style 
benchmark that has been widely used in the PBFT liter- 
ature. This is intended to model a realistic and common 
workload, where speculation allows significant compu- 
tation to be overlapped with I/O. 

Within the primary-local topology, PBFT takes up to 
5.0x longer to complete than PBFTI-CS (Figure 8). In 
the uniform topology, PBFT takes up to 2.2 x longer than 
PBFIT-CS. Since files are often reused many times during 
the build process, there is less opportunity to benefit from 
l-reply speculation. However, the relative difference in 
performance degradation as latency increases is still sig- 
nificant. With a co-located primary, PBFI-CS becomes 
4.3x slower as delay increases to 15 ms, while PBFT 
slows down by a factor of 25. 


6.9 Cost of failure / faulty primary 


To measure the cost of speculation failures, we mod- 
ified our PBFI-CS relay to inject faulty digests into 
early replies, simulating a primary that returns corrupted 
replies at a rate of 1%. Any speculation based on a 
corrupted reply will eventually be rolled back, and any 
dependent requests will be turned into no-ops on good 
replicas. 

The results of this experiment are presented in Fig- 
ure 9. We used the Apache build benchmark in the 
primary-local topology. The injected faults were respon- 
sible for slowdowns in PBFT-CS of 3%, 9%, and 29% at 
0 ms, 2.5 ms, and 15 ms delay respectively. 

These slowdowns are not identical because a client 
may have a greater number of requests in the pipeline 
for completion at a 15 ms delay than at a O ms delay. 
When one request fails, nearly all outstanding requests 
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Figure 9: For the Apache build benchmark in the 
primary-local topology, PBFT-CS is at worst 29% slower 
when 1% of its speculations fail. 


also fail. We observed that 1% of our speculations failed 
directly, and an additional 1%, 4%, and 5% of specula- 
tions (at O ms, 2.5 ms, and 15 ms respectively) failed due 
to their dependencies. These extra requests added unnec- 
essary load to the replicas. By executing more requests in 
advance, clients must roll back a larger amount of state. 

As discussed in section 3.3, once a client detects that 
1% of requests are failing, it can stop trusting the primary 
to provide good first replies and disable its own specula- 
tion. If replies are signed, each primary can cause only a 
single failed speculation, and the resulting view change 
will dominate recovery time. For reference, over 100 
failed speculations in this benchmark result from a 1% 
failure rate. 


7 Related work 


This paper contributes the first detailed design for apply- 
ing client speculative execution to replicated state ma- 
chine protocols. It also provides the first design and im- 
plementation that uses client speculation to hide latency 
in PBFT [8]. 

Speculator [28] was originally used to hide latency in 
distributed file systems, and thus our work shares many 
of Speculator’s original goals. Speculator’s distributed 
file system application assumes the existence of a cen- 
tral file server that always knows ground truth. No such 
entity exists in a replicated state machine. For instance, 
non-faulty replicas may disagree about the ordering of 
read-only requests as discussed in Section 3.2.2. Prior to 
this paper, Speculator was only used to speculate on zero 
replies. The possibility of also speculating on a single 
reply opens up several potential protocol optimizations 
that we have explored, including the possibility of gen- 
erating early replies and optimizing agreement protocols 
for throughput. 

Speculative execution is a general computer science 
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concept that has been successfully applied in hardware 
architecture [15, 17, 35], distributed simulations [19], 
file I/O [10, 16], configuration management [36], dead- 
lock detection [26], parallelizing security checks [29], 
transaction processing [20] and surviving software fail- 
ures [12, 31]. This work contributes by applying specu- 
lation to another domain, replicated state machines. 

There has also been extensive prior work in the de- 
velopment of replicated state machines, both in the fail- 
stop [24, 30, 34] and Byzantine [1, 8, 11, 21, 22, 32, 37] 
failure models. While Byzantine fault tolerance in par- 
ticular has been an area of active research, it has seen 
relatively limited deployment due to its perceived com- 
plexity and performance limitations. 

Our client-side speculation techniques apply equally 
well to reducing latency in both fail-stop and Byzantine 
fault tolerance protocols. However, they are particularly 
useful for protocols that tolerate Byzantine faults due to 
the higher latencies of such protocols. 

PBFT [8] provides a canonical example of a Byzan- 
tine fault-tolerant replicated state machine, using multi- 
ple phases of replica-to-replica agreement to order each 
operation. Several systems since PBFT have aimed to re- 
duce the latency in ordering client operations, typically 
by optimizing for the no-failure case [22] or for work- 
loads with few concurrent writes [1, 11]. 

Byzantine quorum state machine replication protocols 
such as Q/U [1] build upon earlier work in Byzantine 
quorum agreement [3, 4, 13, 27], and provide lower la- 
tency in the optimal case. Q/U is able to respond to write 
requests in a single phase, provided that there are no 
write operations by other clients that modify the service 
state; inconsistent state caused by other clients requires a 
costly repair protocol. HQ [11] aimed to reduce the cost 
of repair, and reduces the number of replicas required in 
a Byzantine Quorum system from 5f +1 to 3f +1, but it 
introduces an additional phase to the optimized protocol. 

Agreement protocols that use a primary replica are 
able to batch multiple requests into a single agreement 
operation, greatly reducing the overhead of the proto- 
col and increasing throughput. While our protocol ap- 
plies to both quorum and agreement protocols, the higher 
throughput offered by batched agreement, along with re- 
silience during concurrent write workloads, makes them 
a better match for our techniques. 

Our work on client speculation complements the 
server-side use of speculation in Zyzzyva [22]. In 
Zyzzyva, replicas execute operations speculatively based 
on an ordering provided by the primary, while in our sys- 
tem clients speculate based on an early response from the 
primary (or on 0 replies), with replicas executing only 
committed operations. These two approaches are com- 
plementary. Client speculation allows a client to issue a 
subsequent operation after only a single phase of com- 
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munication with the primary, which is especially helpful 
for geographically dispersed deployments where some 
replicas are far from the client. Server speculation speeds 
up how fast replicas can supply a consensus response 
to the client, which would allow clients in our system 
to commit speculations faster. While we have evalu- 
ated client speculation on the PBFT protocol, it would 
apply equally well to Zyzzyva, where the client can re- 
ceive early speculative and consensus responses, in the 
absence of failures. 


8 Conclusions and future work 


Replicated state machines are an important and widely- 
studied methodology for tolerating a wide range of 
faults. Unfortunately, while replicas should be dis- 
tributed geographically for maximum fault tolerance, 
current replicated state machine protocols tend to mag- 
nify the effects of the long network latencies associated 
with geographic distribution. In this paper, we have 
shown how to use speculative execution at clients of a 
replicated service to reduce the impact of network and 
protocol latency. We outlined a general approach to us- 
ing client speculation with replicated services, then im- 
plemented a detailed case study that applies our approach 
to a standard fault tolerant protocol (PBFT). 

In the future, we hope to apply client speculation to 
a wider range of protocols and services. For example, 
adding client speculation to a protocol that uses server 
speculation [22] should allow clients to commit specula- 
tions faster. It may also be possible to apply client spec- 
ulation to protocols that use more complex replication 
schemes, such as erasure encoding [18], although clients 
of such protocols may require more than one reply to 
predict the final response with high probability. 
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Abstract 


icreasingly people manage and share information 
cross a wide variety of computing devices from cell 
hones to Internet services. Selective replication of 
ontent is essential because devices, especially portable 
nes, have limited resources for storage and communica- 
on. Cimbiosys is a novel replication platform that per- 
1its each device to define its own content-based filtering 
riteria and to share updates directly with other devices. 

In the face of fluid network connectivity, redefinable 
ontent filters, and changing content, Cimbiosys ensures 
vO properties not achieved by previous systems. First, 
very device eventually stores exactly those items whose 
itest version matches its filter. Second, every device 
presents its replication-specific metadata in a compact 
orm, with state proportional to the number of devices 
ither than the number of items. Such compact repre- 
sntation results in low data synchronization overhead, 
‘hich permits ad hoc replication between newly encoun- 
‘red devices and frequent replication between estab- 
shed partners, even over low bandwidth wireless net- 
‘orks. 


Introduction 


lelivering information that is relevant to different 
eople—or is appropriate for different devices—requires 
ystem support for a richer notion of data synchroniza- 
on, one that incorporates personalized content filtering. 
1 many social and work settings, where bandwidth, stor- 
ge, and human attention may be at a premium, filtering 
nables information to spread according to interests and 
>quirements. Personal information needs do not always 
dhere to the rigid organizational structures imposed by 
ata providers [3], but rather can often be characterized 
y flexible query-like predicates over the contents of di- 
erse data collections. 

At the same time, timely and robust information shar- 
1g cannot always rely on established Internet connectiv- 
y or depend on centrally managed storage. Communi- 
ation between devices may be ad hoc, taking advantage 
f the proximity of neighboring devices and the avail- 
bility of particular content. For example, in the wake of 
[urricane Katrina, disaster workers needed to quickly set 
p ad hoc networks in which communication and control 
rere distributed and egalitarian [5]. 
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In this paper, we present Cimbiosys, a replicated stor- 
age platform designed to support collaboration within 
loosely-organized communities with applications such as 
home media management and shared calendars and to fa- 
cilitate the interplay between mobile devices and cloud- 
based services. The main contribution of this work is 
demonstrating how to permit content-based partial repli- 
cation among peers while providing two important sys- 
tem properties: 


e Eventual filter consistency: Each device eventually 
stores precisely those items that would be returned 
by running its custom filter query against the full 
data collection. 


e Eventual knowledge singularity: The state that is 
transmitted between devices in synchronization re- 
quests and is used to identify unknown latest ver- 
sions converges to a size that is proportional to the 
number of replicas in the system rather than the 
number of stored items. 


Eventual consistency has long been demanded by ap- 
plications and provided in replicated systems. Ensuring 
eventual filter consistency in a system that permits peer- 
to-peer synchronization between devices with individual, 
content-based filters is more challenging. Not only may 
a device’s interest in specific items fluctuate over time 
as the items are updated, but a device may vary its fil- 
tering criteria, causing items with stable contents to en- 
ter and leave the device’s interest set. The next section 
expands on the substantial challenges of content-based 
partial replication. 

Eventual knowledge singularity is a new property 
we have defined to convey the importance of compact 
synchronization-specific state in making economical use 
of bandwidth and system resources. Essentially, even- 
tual filter consistency is an important correctness prop- 
erty while knowledge singularity is hidden from appli- 
cations but provides performance and convergence ben- 
efits. In particular, this property allows Cimbiosys to 
use brief intervals of connectivity between peer devices 
and permits more frequent exchanges between regular 
synchronization partners, thereby reducing convergence 
delays. By contrast, conventional synchronization tech- 
niques that exchange per-item version vectors or rely on 
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Figure 1: Photo sharing 


operation logs make less effective use of relatively slow 
or intermittent connections. In such systems, the data ex- 
changed during synchronization is roughly proportional 
to the collection size or dependent on the update rate; this 
limitation becomes important as collection sizes grow 
into the tens of thousands of items and items are updated 
repeatedly. 


2 Challenges 


To further illustrate the needs of applications that manage 
partially replicated data, consider the photo sharing sce- 
nario depicted in Figure 1. Alice is traveling in Thailand, 
photoblogging as she goes. Each night, the day’s photos 
are copied from Alice’s camera to her laptop. When she 
reaches a town with an Internet café, she uploads select 
photos to her Flickr account. After Alice returns from her 
trip, her photos are synchronized with the master collec- 
tion on her PC. She spends several weeks working with 
her new photos on the PC, rating them using one to five 
stars, adding additional tags, and cropping or retouching 
photos. Five-star photos are uploaded via a direct WiFi 
connection to her living room’s photo frame. Photos that 
Alice tags “public” are uploaded to a travel photoset on 
Flickr and onto a photojournalism web site. A copy of all 
of her family photos are retained on her laptop, so she’ll 
have them with her when she travels again. 

This scenario reveals an implicit set of requirements 
for a modern storage platform: 


e Updates may originate from multiple sites and pro- 
duce new versions of items that must be selectively 
disseminated to various devices. 


e Interdevice communication may be ad hoc, taking 
advantage of device proximity and the availability 
of particular content. 
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e Not all devices (or even cloud-resident services) 
store complete collections, and the items of inter- 
est vary across devices according to their uses and 
capabilities. 


At first blush, adding content-based filtering to a repli- 
cation protocol may seem straightforward. Start with any 
protocol that fully replicates data and guarantees even- 
tual consistency. Whenever a data item is about to be sent 
via this protocol, check the contents of the item against 
the destination device’s filter. If the item matches, and 
hence is of interest to the destination, then continue to 
send the item; if it does not match, then ignore the item. 
Unfortunately, this simple scheme does not ensure even- 
tual filter consistency. 

Content-based filtering for devices with arbitrary com- 
munication topologies introduces five key challenges: 


e effective connectivity: ensuring, in the face of vary- 
ing device-specific filters, that every item has a path 
by which it can flow to all interested parties; 


e partial synchronization: permitting incremental 
synchronization between peers with overlapping 
interests without wasting bandwidth on duplicate 
items or excessive exchanges of metadata; 


e item move-outs: informing devices of items they 
store that no longer match their filters due to more 
recent updates; 


e out-of-filter updates: determining how to propagate 
and when to safely discard updated items that do not 
match the updating device’s own filter; and 


e filter changes: allowing a device to modify its fil- 
ter without completely discarding previously stored 
items or failing to receive items that match its new 
filter. 


Unless these issues are explicitly addressed by the 
replication protocol design, they can prevent eventual fil- 
ter consistency. We now describe each of these problems 
in more detail; solutions are presented in later sections. 

A synchronization topology can be viewed as a graph 
where devices (or services) are the nodes, and edges in- 
dicate synchronization partnerships between pairs. Cus- 
tom synchronization topologies that permit indirect com- 
munication between devices are desirable; Alice’s photo 
frame never directly synchronizes with her camera, for 
instance. In a fully replicated system, eventual consis- 
tency can be achieved as long as the topology graph is 
connected and devices at least occasionally synchronize 
with their neighbors. As long as these basic conditions 
are met, topology-independent protocols accomodate ar- 
bitrary communication patterns. In a system with par- 
tially replicated collections, additional issues arise. For 
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example, in the scenario in Figure 1, if Alice’s home 
PC never directly synchronized with her laptop, then the 
only path for routing new, tagged photos from Alice’s 
laptop to her PC would be through services in the cloud, 
such as Flickr. In this case, the PC would only receive 
laptop-resident photos that are tagged as “public” and, 
hence, have been uploaded to a photo-sharing service. 
Section 7 discusses the topology constraints enforced by 
Cimbiosys to ensure effective connectivity. 


The problem of partial synchronization arises when a 
device synchronizes from a partner that can only supply 
some of the items that match the device’s filter. For ex- 
ample, while Alice is traveling and uploading select pho- 
tos to Flickr, her home PC may synchronize daily with 
the service and obtain these new photos. When Alice re- 
turns home and her laptop synchronizes directly with her 
PC, the PC should not assume that it has already received 
from Flickr any photos taken more than a day ago. In 
general, a device may receive some items that match its 
filter from one synchronization partner and other items 
of interest from other partners. Section 4.2 introduces 
the notion of item-set knowledge to deal with this issue. 


An item is said to move out of a device’s interest set 
when an update to the item causes it to no longer match 
the device’s filter. For example, suppose that Alice de- 
cides that one of her public photos is a bit too revealing, 
and so she edits the photo on her PC to remove the “‘pub- 
lic” tag. Using the simple replication approach outlined 
earlier, this updated photo would not be sent to Flickr 
when it next synchronizes with Alice’s PC. However, the 
previous version of this photo, the one marked as pub- 
lic, would remain indefinitely on Flickr’s web site, con- 
trary to Alice’s intentions (and violating eventual filter 
consistency). Replication protocols that support content- 
based filtering not only must selectively propagate up- 
dated items but also must inform devices of items that 
should be discarded. Section 5.1 indicates the conditions 
under which Cimbiosys delivers move-out notifications 
during synchronization. 

The fourth challenge is dealing with out-of-filter up- 
dates. A device might update an item producing a ver- 
sion that does not match the device’s own filter. For ex- 
ample, suppose that Alice is working on her laptop and 
edits one of her private photos to remove the “family” 
tag (perhaps a photo of her sister’s ex-husband). In this 
case, Alice’s laptop cannot discard the photo immedi- 
ately, even though it does not match the laptop’s filter, 
since doing so would prevent other devices from learning 
of this edit; the photo can only be discarded by the laptop 
after it synchronizes with the home PC and sends it the 
new version. In some situations, none of a device’s reg- 
ular synchronization partners may be interested in out- 
of-filter updates that it makes. Section 5.2 addresses this 
issue. 


A final challenge arises from the need to support 
changing filters. A person’s information needs may vary 
over time, causing her to change some devices’ filters. 
For example, Alice might decide one day that she wants 
only 5-star photos uploaded to the photojournalism web 
site rather than all of her public photos. One option is for 
a device, upon a change to its filter, to discard all of its lo- 
cally stored items, reset its synchronization state, and es- 
sentially restart as a new replica. However, this approach 
wastes critical resources, such as network bandwidth and 
energy, and may disrupt the person’s work. Section 5.3 
details our approach to filter changes. 


3 Cimbiosys Platform 


Cimbiosys is a platform developed to support a variety 
of applications that manage data on mobile devices, per- 
sonal computers, and cloud-based services. It was de- 
veloped as part of a research project exploring issues in 
community information management (CIM). 


3.1 System model 


In the Cimbiosys distributed architecture, each partici- 
pating node, hereafter simply called a device, stores full 
or partial copies of one or more data collections. A 
collection, for instance, might be an individual’s digital 
photo album, a family’s calendar, a shared video library, 
or a company’s customer database. Each collection is 
managed separately and consists of a set of items that are 
not shared with other collections. 

An item is an XML object plus an optional associ- 
ated file. For example, a photo item stores its JPEG 
data in a conventional file and the associated XML object 
holds descriptive information, such as when the photo 
was taken, its resolution, a quality rating, and human- 
supplied keywords. 

A replica contains copies of some or all of the items 
in a given collection. A device can hold any number of 
replicas of different collections. For simplicity, all of the 
examples used in this paper involve a single collection 
and a single replica per device. 

Each device sharing a collection maintains its own 
replica of the items of interest. The set of items included 
in a device’s replica is specified by a filter, which is a se- 
lection predicate over the items’ XML contents. For ex- 
ample, a filter might select e-mail messages from a par- 
ticular individual, files tagged with certain keywords, or 
photos with a 5-star rating. The default “*” filter indi- 
cates that the device is interested in all items, and hence 
stores a full replica of the collection. Users can set dif- 
ferent filters for each device and can change these filters 
over time. 

Each device is allowed to read its locally stored items 
and update those items at any time, as long as such up- 
dates are in accordance with the collection’s access con- 
trol policy. Update operations are applied directly to 
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items in the device’s local replica; such operations are 
not logged or explicitly recorded. Updates produce new 
versions of items that are later sent to other replicas via a 
device-to-device synchronization protocol. Devices gen- 
erally have regular synchronization partners but may also 
synchronize with any replica that they encounter. 


A device can join the system simply by creating a new 
(empty) replica of some collection and then synchroniz- 
ing with some existing replica(s). Collections and their 
replicas can be discovered by a variety of means, includ- 
ing social networking web sites, e-mail invitations, nam- 
ing directories, and wireless discovery protocols. 


A replica may remain disconnected from the rest of 
the system for an arbitrary amount of time due to device 
failures or lack of network connectivity. However, we 
assume that each device eventually recovers with its per- 
sistent storage intact, occasionally communicates with 
other devices, and correctly executes the synchronization 
protocol. A device can permanently retire and discard its 
local replica but must first synchronize with some other 
device to ensure that updates are not lost. 


At any point in time, a replica may hold older versions 
of items that have been updated elsewhere, and it may not 
have learned yet of recently created or deleted items. The 
Cimbiosys synchronization protocol guarantees eventual 
filter consistency. That is, a replica eventually receives 
all versions of items that match its filter and have not 
been overwritten by later versions, and the replica even- 
tually discards items that are updated in such a way that 
their contents no longer match the replica’s filter. 


Cimbiosys does not provide other guarantees such as 
causal consistency or multi-item coherence. In particu- 
lar, versions may be received by a device in a different 
order than they were produced. Moreover, a set of ver- 
sions for items that were updated atomically at one de- 
vice may be partially received by another device whose 
filter only matches a subset of the items. 


Naturally, because Cimbiosys allows updates to be 
made at any replica without locking, two (or more) de- 
vices may perform concurrent updates to the same item. 
Such updates result in conflicting versions that are prop- 
agated throughout the system using the synchronization 
protocol. Any device whose filter selects both conflicting 
versions may detect the conflict and either resolve it auto- 
matically or store both versions pending manual resolu- 
tion. Resolving a conflict produces a new version of the 
item that supersedes all known conflicting versions. Any 
existing technique for detecting conflicts, such as per- 
item versions vectors [16] or concise predecessor vec- 
tors [12], could be adopted for use with content-based 
partial replication. Thus, no further discussion of con- 
flict management appears in this paper. 
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Figure 2: Cimbiosys software architecture 


3.2 Software components 


Each device in Cimbiosys runs the set of software mod- 
ules depicted in Figure 2. The /tem Store manages the 
items for local replicas of one or more collections. The 
file portion of each item is stored in a special directory in 
the device’s local file system. XML objects are stored in 
an SQL Server (Compact Edition) database where they 
can be queried and updated transactionally. 

The Communication module is responsible for trans- 
mitting data to other devices using available networks, 
such as the Ethernet, WiFi, cellular, or Bluetooth. It 
also encapsulates the transport protocol used by the Sync 
module. Devices are free to use a variety of transport 
protocols, including SOAP-based RPC, HTTP, and Mi- 
crosoft’s FeedSync, a set of simple extensions to RSS. 
Of course, any two devices must agree on the network 
and transport protocol that they use during synchroniza- 
tion. 

The Sync module implements the synchronization pro- 
tocol described in Section 4. During synchronization, it 
enumerates versions of items in the local Item Store that 
are unknown to the remote sync partner and sends these 
along with the appropriate metadata. The remote partner 
then adds the received items to its Item Store, possibly re- 
placing older versions of these items. We are considering 
allowing devices to keep multiple versions if requested 
by an application, but our current implementation retains 
only the latest known version of each item. 

Cimbiosys also includes a number of Utilities for 
recording information about regular synchronization 
partners, naming collections and devices, managing ac- 
cess controls, and performing other configuration func- 
tions. 

Security considerations permeate the Cimbiosys de- 
sign. For example, all versions of items are digitally 
signed by the originating device, and collection-specific 
policies dictate which devices are allowed to create, up- 
date, and delete items in a collection. Versions produced 
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by a device without write access to the collection (or to 
the specific items) are rejected during synchronization. A 
full discussion of the access control design can be found 
in a companion paper [22]. Additionally, techniques 
have been developed for recovering from corrupt ver- 
sions that are introduced through malice or misuse [11]. 


Applications interact with the Cimbiosys platform us- 
ing a specially developed application programming in- 
terface (API). Through this API, an application can cre- 
ate a new collection, create a local replica for an ex- 
isting collection, add items to a collection, update and 
delete items, run queries over items, initiate synchroniza- 
tion between a local and a remote replica, establish regu- 
lar synchronization partnerships, change access permis- 
sions, and change a replica’s filter. Legacy applications 
that read and write local files, and do not use the Cim- 
biosys API, are supported by “watcher” processes that 
monitor file system directories and import files into (or 
delete items from) a local replica. 


3.3. Implementation and validation 


Cimbiosys has been implemented in two different en- 
vironments. One implementation is in C# using Mi- 
crosoft’s .NET Framework running on Windows. We 
plan to port this code to Windows Mobile 6.0 so it can 
run on handheld mobile devices. The other implementa- 
tion is in Mace, a C++ language extension that supports 
distributed systems development [8]. Both implementa- 
tions are used in the evaluation presented in Section 8. 


Additionally, the synchronization protocol has been 
fully specified in TLA+ [10]. Extensive model check- 
ing has been performed on both the TLA+ specification 
and the Mace implementation to ensure that the protocol 
meets the stated design goals, that is, achieves eventual 
filter consistency and eventual knowledge singularity un- 
der a variety of operating conditions. 


Two applications have been designed and are intended 
for deployment in our lab. Cimetric, implemented in C#, 
is a collaborative authoring tool. It coordinates access 
and updates to the complex, heterogeneous set of text, 
graphics, and data files created and modified in the pro- 
cess of writing a paper. Authors receive their own repli- 
cas of the paper, perform local updates, and make those 
updates visible to coauthors when they are ready to share 
a new version. CimBib is designed as a bibliographic 
database and personal digital library in which colleagues 
can share references to local and remote copies of pub- 
lished papers as well as personal annotations and recom- 
mendations; this application is still in a user-centered de- 
sign phase. The designs of both Cimetric and CimBib 
were informed by a qualitative field study of scholarly 
writing and reference use [13]. 
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Figure 3: Sample metadata held on the photo frame 


4 CIM Sync Basics 


The next three sections focus on a key aspect of the Cim- 
biosys platform, the synchronization protocol. The ba- 
sic protocol is introduced in this section; Sections 5 and 
6 address how the protocol meets the challenges of fil- 
ter consistency (storing the items that currently match a 
replica’s filter and no other items) and knowledge singu- 
larity (operating efficiently by optimizing the metadata 
that is exchanged during synchronization). 


4.1 Metadata 


The CIM Sync protocol relies on both per-item and per- 
replica metadata. Each collection and each item in a col- 
lection has a unique identifier, as does each replica of 
a collection. Each version of an item also has a unique 
identifier called its version-id. Whenever an item is cre- 
ated, updated, or deleted, the replica on which this op- 
eration is performed creates a new version-id for the 
item consisting of the replica’s identifier coupled with 
a counter of the number of update operations that have 
been performed by that replica. Deleted items are simply 
marked as deleted; such items are treated as out-of-filter 
versions as discussed in Section 5.2 and are eventually 
discarded by all replicas. 

For each item in a replica, the Cimbiosys item 
store maintains the item’s unique identifier, version-id, 
XML-+file contents, deleted bit, and additional informa- 
tion used to detect whether different versions of the item 
are in conflict (similar to the made-with knowledge used 
in WinFS [15]). Only the latest known version of each 
item is retained in the item store. Older versions are con- 
sidered obsolete. 

Figure 3 depicts the data and metadata maintained by 
a sample replica in our photo sharing scenario. This 
particular replica, the digital photo frame, is known as 
replica 6. Note that uppercase letters are used through- 
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out this paper as unique replica identifiers while low- 
ercase letters are used as unique item identifiers. This 
replica has not performed any local updates, and hence 
its updateCount is zero. Its filter indicates that it is in- 
terested only in photos with a 5-star rating. The replica’s 
item store is shown as a table at the bottom of the figure. 
It stores four photos: items p, g, r, and k. Every item has 
a unique version-id. Item p, for instance, has a version- 
id of A:1, meaning that this version was produced by 
replica A’s first update operation, and has a rating of 5 
stars. Each item has additional data and metadata that is 
not shown in the figure, such as the actual photo contents 
and the deleted bit. Finally, this replica has knowledge 
about the items that it stores as described next. 


4.2 Item-set knowledge 


Each replica maintains knowledge recording the set of 
versions that are known to the replica. Conceptually, 
a replica’s knowledge is simply a set of version-ids; it 
contains identifiers for any versions that (a) match the 
replica’s filter and are stored in its item store, (b) are 
known to be obsolete, or (c) are known to not match 
the replica’s filter. Including the third class of versions, 
out-of-filter versions, and using a novel representation 
called item-set knowledge distinguishes the knowledge 
used in CIM Sync from that of other replication proto- 
cols like Bayou [18] that do not support content-based 
partial replication. 

Knowledge is represented as one or more fragments 
where each fragment is a version vector [16] and an as- 
sociated explicit set of item ids. The version vector com- 
ponent indicates, for each replica that has updated any 
item in the collection, the latest known version-id gen- 
erated by the replica. Semantically, if a replica holds a 
knowledge fragment S:V then the replica knows all ver- 
sions of items in the set S whose version-ids are included 
in the version vector V. When a replica’s knowledge 
contains multiple fragments, the replica’s overall know]- 
edge is the union of the version-ids from each fragment. 
Note that, from its knowledge alone, a replica cannot de- 
termine whether a known version is stored, obsolete, or 
out-of-filter. 

For example, replica 6 in Figure 3 has a single knowI- 
edge fragment whose item-set is {k,p,q,r}, the ids of 
the four items that are stored by this replica, and a ver- 
sion vector of <A:4, C:1>. Replica B, the photo frame, 
does not appear in the version vector since it never di- 
rectly updates items and hence does not generate any 
versions. Replica B’s knowledge indicates that the de- 
vice is aware of any versions of items k, p, g, or r with 
a version-id of A:1, A:2, A:3, A:4, or C:1. It does not 
mean, however, that each of these version-ids 1s for a cur- 
rent or obsolete version of one of these items. To permit 
a compact knowledge representation, the version vector 
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may include version-ids for items that are not in the as- 
sociated set; technically, those versions are not known to 
the replica. For instance, version A:2 could be the latest 
version of some item w that is not stored by replica B and 
that may or may not match its filter. 

A knowledge fragment may specify “*’ as the item- 
set, meaning that the set includes all items in the col- 
lection. Such fragments are called star-knowledge. Ina 
system consisting entirely of full replicas, each replica’s 
knowledge is always a single star-knowledge fragment. 
Partial replicas introduce the need for item-set knowI- 
edge in addition to star-knowledge. In a system with 
a mix of full and partial replicas, any replica may have 
both star-knowledge and any number of item-set knowl- 
edge fragments, at least temporarily. For instance, after 
synchronizing from a partial replica, a full replica may 
end up with item-set knowledge reflecting the set of re- 
ceived items. 


4.3 Filtered synchronization 


Cimbiosys uses a one-way, pull-style synchronization 
protocol. A replica, called the target replica, initiates 
synchronization with another replica, called the source 
replica. Each device generally plays the role of the target 
replica for some synchronization sessions and the source 
replica for others. Two-way synchronization requires a 
pair of devices to synchronize, switch roles, and then 
synchronize again. 

The target replica starts by sending a SyncRequest 
message that includes the target’s knowledge and its fil- 
ter. The target is not sent any versions that are already 
included in its knowledge or that are not of interest. In 
particular, the source replica checks its item store for 
any items whose version-ids are not known to the target 
replica and whose XML contents match the target’s filter. 
The XML contents, file contents, and metadata for each 
of these items are returned to the target. If possible, as 
discussed in Section 5.1, the source replica also informs 
the target replica of items that no longer match its filter. 
Finally, the source replica responds with a SyncCom- 
plete message including one or more knowledge frag- 
ments that are added to the target’s knowledge. At the 
very least, this learned knowledge includes knowledge 
pertaining to items transmitted during this synchroniza- 
tion session but may include additional version-ids as 
discussed in Section 6.1. 

The messages received by the target replica can be ap- 
plied to its item store individually or as a single atomic 
transaction. Updating items (and the replica’s knowIl- 
edge) as new versions are received allows progress to 
be made even when a connection is interrupted before 
the synchronization protocol completes. The knowledge- 
driven nature of the protocol makes it resilient to device 
crashes and lost messages. 
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Figure 4 illustrates a synchronization session from our 
scenario in which the digital photo frame (replica B) re- 
quests items from the laptop (replica C’). The state shown 
for each device 1s the metadata and item store before syn- 
chronization. The arrows show the messages that are 
sent during synchronization. Note that the photo frame’s 
knowledge that is sent in the SyncRequest message spec- 
ifies that it knows about four items, but has not seen any 
updates from the laptop since version C':1. The laptop, 
the source replica in this example, returns a more recent 
version of item r that it produced and a new item s that 
had been created at replica A. Item & had also been up- 
dated on the laptop to reduce the photos rating; hence the 
laptop notifies that photo frame that this item is no longer 
of interest. The final message informs the photo frame of 
the knowledge it learned from the laptop. This learned 
knowledge consists of two knowledge fragments, sepa- 
rated by a plus sign, which means that the photo frame 
will end up with three knowledge fragments after pro- 
cessing the SyncComplete message. 

The following sections describe in more detail specific 
protocol features devised to support the requirements of 
partial replication. 


5 Eventual Filter Consistency 


Although the use of item-set knowledge in the CIM Sync 
protocol guarantees that replicas eventually receive all 
items of interest (assuming sufficient effective connectiv- 
ity), it does not ensure eventual filter consistency. This 
section presents additional techniques needed to deal 
with move-outs, out-of-filter updates, and filter changes. 


> 


During synchronization, the target replica may receive 
move-out notifications from the source replica when 
items have later versions that no longer match its filter. 


Move-out notifications 
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Figure 4: Example synchronization between a target replica, the photo frame, and a source replica, the laptop 
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These cause the target to remove specified items from 
its item store. There are two conditions under which the 
source returns move-out notifications. 


The simplest condition is when the source replica 
stores an item whose version is not known to the target 
replica and whose contents do not match the target’s fil- 
ter. The source can send a move-out notification for any 
such item. This is the condition illustrated in Figure 4 
where the laptop sends a move-out notification for item 
k;, whose rating had been reduced. 


A target replica may receive move-out notifications for 
items that it does not store, such as items that are updated 
and continue to not match the target’s filter, a potentially 
common occurrence. For example, suppose that the lap- 
top in Figure 4 updated item ¢ producing version C’:6 in 
which the rating was unchanged but a new caption was 
added to this photo. In this case, when the photo frame 
next synchronizes from the laptop, it would be sent a 
move-out notification for item ¢ even though it does not 
store this item and perhaps never did. Such spurious no- 
tifications do not affect eventual filter consistency since 
they will simply be ignored by the receiving replica, but 
they do consume network and processing resources. 

To avoid spurious move-out notifications, a SyncRe- 
quest message may optionally include a set of identi- 
fiers for items that are stored by the requesting replica. 
The source replica only sends move-out notifications for 
items that are in this set. Replicas cache this item set 
for their regular synchronization partners, allowing these 
partners to send deltas, that is, to send just the set of 
newly acquired items. 


Sending move-out notifications for items that are 
stored at the source replica is insufficient. Consider the 
case of a replicated customer relationship database in 
which a server holds the complete database, Bob’s lap- 
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top holds items for all California customers, and his cell 
phone stores items for customers that live in Los An- 
geles. Bob’s cell phone synchronizes periodically with 
his laptop but never directly with the server database. 
Suppose that a customer moves from Los Angeles to 
Chicago. When Bob’s laptop synchronizes with the 
server, it receives a move-out notification causing the 
laptop to drop this customer from its local replica. But 
then how does Bob’s cell phone learn that it also should 
discard this item? 

The second condition for sending a move-out notifi- 
cation for an item is as follows: the target replica stores 
the item, the source replica does not store the item, the 
source replica’s filter is no more restrictive than the tar- 
get’s filter, and the source’s knowledge for this item is 
greater than the target’s knowledge. In other words, if 
the source is interested in all items of interest to the target 
and is more knowledgeable than the target, it can deduce 
that any items it does not store should also be removed 
from the target’s item store. This relies on the source 
being informed of the set of items that are stored by the 
target. 


5.2 Out-of-filter updates 


To preserve versions produced by out-of-filter updates, 
the updated items are placed in a special portion of the 
updating replica’s item store called the push-out store. 
Items in the push-out store are not visible to applications, 
but are treated like any other item during synchroniza- 
tion. In particular, such items are sent to a synchroniza- 
tion partner if they match its filter, and may be overwrit- 
ten by items received from a sync partner, possibly caus- 
ing the item to move back into the regular item store. 
Unfortunately, a replica might not have any synchro- 
nization partner whose filter matches the items in its 
push-out store. Thus, when synchronizing with any 
replica with an equal or less restrictive filter, a replica 
sends all items in its push-out store, and then optionally 
discards these items once it learns that they were success- 
fully received by the target replica. This partner accepts 
these items even if they don’t match its filter. Such items 
may end up in the target replica’s push-out store, from 
where they are passed to another replica. However, this 
could lead to situations in which two replicas play “hot 
potato” by passing back and forth an item that matches 
neither of their filters. Section 7 discusses restrictions 
that Cimbiosys places on the synchronization topology 
to avoid the hot potato problem and guarantee that out- 
of-filter updates eventually reach all interested replicas. 


5.3. Changing filters 


Cimbiosys permits arbitrary filter changes while allow- 
ing replicas to retain as many items as possible. When a 
replica changes its filter it may need to discard items or 
knowledge or both depending on the nature of the filter 
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change. If the new filter is more restrictive than the pre- 
vious filter, that is, if it matches fewer items, then items 
that no longer match the filter are moved to the replica’s 
push-out store. The replica cannot simply discard such 
items since it may be the only replica that holds the latest 
versions. As discussed above, items from the replica’s 
push-out store will eventually be discarded after they are 
passed to another replica (or it is determined that they 
are already stored by another replica). Although some 
in-filter versions may become out-of-filter versions, the 
replica’s knowledge does not change. 


If the new filter is less restrictive than the previous fil- 
ter, then previously out-of-filter versions may now match 
the new filter. Such versions need to be removed from 
the replica’s knowledge so that the replica will receive 
them during future synchronizations. Unfortunately, the 
replica cannot determine which versions in its knowl- 
edge are out-of-filter and which are obsolete. So, con- 
servatively, its knowledge must be retracted to include 
only versions of items that it already stores. The repre- 
sentation of item-set knowledge makes retraction easy. 
Knowledge fragments with explicit item-sets retain the 
Same version vector but with a possibly smaller set of 
items; any star-knowledge fragments are converted to 
item-set knowledge. 


If the new filter is neither less restrictive nor more 
restrictive than the previous filter, that is, if the old 
and new filters are incomparable, then both cases apply. 
The replica may need to move non-matching items to 
its push-out store. The replica also needs to retract its 
knowledge. 


Since replicas are allowed to change their filters at any 
time, a replica may receive out-of-date move-out notifi- 
cations based on a previous filter. To guard against pro- 
cessing out-of-date notifications, a replica increments a 
counter whenever it updates its filter. Essentially, this 
counter serves as a version identifier for the replica’s fil- 
ter. The filter version number is included in each syn- 
chronization request and is returned in each move-out 
notification. Move-out notifications that include old fil- 
ter version numbers are simply ignored by the receiving 
replica. 


6 Eventual Knowledge Singularity 


In this section, we propose mechanisms by which repli- 
cas acquire and compact their knowledge. Although 
the number of fragments in a replica’s knowledge may 
temporarily grow after synchronization, the knowledge 
tends to converge towards a single star-knowledge frag- 
ment represented as a single version vector. This section 
shows how we achieve the desired state of knowledge 
singularity for both full and partial replicas. 
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6.1 Acquiring knowledge 


As replicas receive items during synchronization, they 
add the items’ version-ids to their knowledge, but re- 
quire some other means of learning about obsolete and 
out-of-filter versions. The SyncComplete message at the 
end of the synchronization protocol conveys knowledge 
that the target replica learned during this sync session. 
The target replica adds this learned knowledge to its own 
knowledge, generally as new knowledge fragments. This 
knowledge can include any version-ids for items cur- 
rently stored by the source replica as well as any ids for 
versions that the source knows to be obsolete. It may 
not, however, include versions that are out-of-filter at the 
source replica but could match the target replica’s filter 
as this would cause the target replica to fail to receive 
such versions from other replicas. 

The learned knowledge, therefore, depends on the re- 
lationship between the filters of the synchronizing repli- 
cas. If the source replica’s filter is no more restrictive 
than the target’s filter, that is, if any item that matches 
the target’s filter also matches the source’s filter, then 
the source replica can send its complete knowledge in 
the SyncComplete message; any out-of-filter versions in- 
cluded in the source’s knowledge will also be out-of- 
filter with respect to the target replica. In other cases 
in which the target has a broader filter or a disjoint filter 
compared to the source, the source replica must restrict 
the conveyed learned knowledge to those items that it 
actually stores. Figure 4 shows an example of disjoint 
filters; the photo frame’s filter is based on the rating at- 
tribute and the laptop’s filter is based on the value of the 
photo’s keyword (in this case, ’family”’). 


6.2 Compacting knowledge 


Whenever a replica synchronizes with another replica, it 
receives new knowledge fragments. To reduce the num- 
ber of fragments in its knowledge and the overall size, a 
replica can compact its knowledge using a set of simple 
rules. For example, suppose the replica’s knowledge in- 
cludes two fragments, 5;:V, and S2:V2. If the set $4 is 
a subset of set Sy and the version vector V2 dominates 
VY, G.e. any versions in V; are also included in V2), then 
the fragment $;:V; is redundant and can be discarded. 
If V, and V5 are identical, then the sets S; and S» can 
be combined into a single knowledge fragment. Table 1 
enumerates compaction rules that can be applied to any 
pair of knowledge fragments. 

While these knowledge compaction rules are effective, 
they don’t always lead to compact knowledge in practice. 
Consider the case of Alice who edits photo r on her lap- 
top (replica C’) producing a new version with version-id 
C’:1, then edits this same photo again to produce a newer 
version C':2. Alice also adds keywords to photos ft, u, and 
k, producing versions C':3, C':4, and C':5. Suppose that 


S7:V1 So: Vo —- 





S7,c So S;=So S;D So otherwise 
So:Vo + 
Vic V: 

— S7-Sa Vj 
V,=V> S1uUS2:V; 
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: S1:VW7UV2 + So:ViUVo + 
th : 
otherwise So-Si:Vo S1:ViUV2 S,-SxV, 











Table 1: Knowledge compaction rules 


these items all match replica C’s filter and are never up- 
dated by other replicas. The state of replica C’ on Alice’s 
laptop is as shown in Figure 4. When Alice’s home PC 
(replica A) synchronizes from her laptop, it will receive 
these items and the associated learned knowledge. The 
home PC’s knowledge would become something simi- 
lar to *:< A:9> + {k, r,t, u}:<A:7, C:5>. Unfortunately, 
this knowledge cannot be compacted. This problem is 
addressed in the remainder of this section. 


6.3 Authoritative versions 


Key to reducing the number of fragments in a replica’s 
knowledge is the notion of authority. A replica is author- 
itative for a version of an item if it either stores the item 
or knows the item to be obsolete. Recall from Section 6.1 
that version-ids for any stored or obsolete versions can 
be included in the learned knowledge acquired by a tar- 
get replica at the completion of the synchronization pro- 
cess. The source replica, therefore, can return a learned 
knowledge fragment in which the item-set is “*” (1.e. all 
items in the collection) and the associated version vector 
includes identifiers for its authoritative versions. In other 
words, during synchronization, the target replica learns 
of any versions of any items for which the source replica 
is authoritative. Moreover, when the target replica’s filter 
is equal to or less restrictive than the source’s filter, the 
target replica becomes an authority for all of the source 
replica’s authoritative versions. 


In our previous example, the laptop (replica C’) is au- 
thoritative for all of the versions that it produced, that 
is, for versions C’':1 through C’:5. Thus, replica C’ sends 
*:<C':5> as learned knowledge when synchronizing to 
any other replica. This knowledge fragment is merged 
into the receiving replica’s star-knowledge, and hence 
does not lead to an increase in the overall number of 
knowledge fragments. A replica’s star-knowledge grows 
so that it eventually dominates other knowledge frag- 
ments, which can then be discarded using the compaction 
rules in Table 1. 
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6.4 Transferring authority 


One practical issue remains, namely how to transfer au- 
thority when an item is no longer of interest to the author- 
itative replica, whether due to out-of-filter updates or to 
filter changes. Such operations cause items to be placed 
in a replica’s push-out store. The replica will cease to be 
authoritative for its own versions that are pushed to an- 
other replica and then discarded. Requiring a replica to 
store indefinitely all of the items that it creates or updates 
would be unreasonable. For instance, a digital camera 
often offloads its photos to a laptop in order to free up 
storage space for new photos. In practice, the system 
simply needs to maintain the invariant that there exists 
at least one replica that is authoritative for every version 
ever generated. 

In Cimbiosys, when a replica sends the items in its 
push-out store to a replica with a less restrictive filter, the 
receiving replica becomes authoritative for these items. 
The sending replica can then discard such items without 
violating the system-wide invariant. Each replica records 
the version-id of the most recent version it has generated 
for which it is no longer authoritative. The replica then 
knows that it is authoritative for any versions it has pro- 
duced with greater version-ids. The learned knowledge 
sent by a replica is a star-knowledge fragment containing 
the range of version-ids from the first version generated 
after its last push-out to its most recently generated ver- 
sion. A replica that has received multiple star-knowledge 
fragments containing overlapping or contiguous version 
ranges can combine these together into a single fragment. 

For example, suppose Alice’s laptop (replica C’) 
changes its filter so that it no longer wants items with 
ratings below three. Version C':5 of item k no longer 
matches. After pushing this item to Alice’s home PC 
(replica A), as well as sending the latest versions of all 
other items, the home PC will have learned *:<C':5>. At 
this point, the laptop discards item k and records C’':5 as 
its last unauthoritative version. Now, suppose that Al- 
ice performs three more updates from her laptop produc- 
ing versions with identifiers C’:6, C':7, and C:8. Dur- 
ing synchronization to another replica, say Alice’s photo 
frame (replica B), the laptop will pass *:<C':6..C':8> as 
learned knowledge. When the photo frame synchronizes 
from the home PC, it will receive learned knowledge of 
*:<C':5> in addition to knowledge of other versions for 
which Alice’s home PC is authoritative. The photo frame 
then combines the knowledge received from the laptop 
with that received from the home PC to get a knowledge 
fragment of *:<C':8>, which in turn is merged with its 
other star-knowledge. 

As a replica synchronizes from other replicas, it ac- 
quires star-knowledge fragments from each of these sync 
partners. Such fragments are combined together into a 
single star-knowledge fragment that is monotonically in- 
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creasing (provided the replica does not expand its filter). 
As long as each replica regularly synchronizes with a set 
of partners that collectively know about all versions in 
the system, each replica will converge towards singu- 
lar knowledge. Clearly, a device that synchronizes di- 
rectly with every other device will receive a complete set 
of star-knowledge. The following section describes how 
Cimbiosys ensures that replicas are configured in a suit- 
able topology without requiring full interconnectivity. 


7 Filter-based Tree Topologies 


The CIM Sync protocol can be used by any set of repli- 
cas with arbitrary filters and arbitrary synchronization 
patterns. When a replica synchronizes with any other 
replica, it will receive all versions stored by its partner 
that match its filter, and it will receive whatever move- 
out notifications can be generated by the partner. More- 
over, a replica never receives the same version from mul- 
tiple synchronization partners (unless it engages in paral- 
lel synchronizations or changes its filter). But additional 
constraints must be placed on the synchronization topol- 
ogy in order to achieve eventual filter consistency and 
eventual knowledge singularity. 

Cimbiosys forces replicas of a given collection to con- 
figure themselves into a hierarchically filtered tree topol- 
ogy. In particular, each replica has a single parent replica, 
except for the replica at the root of the tree, and a 
replica’s filter must be at least as restrictive as that of its 
parent. In other words, a parent replica stores any items 
that are stored by any of its children. The replica at the 
root of the tree has a filter that matches all items; that is, 
it stores a full copy of the collection. This root replica is 
called the reference replica for the collection. Parent and 
child replicas are required to perform synchronization in 
both directions, at least occasionally, but may also syn- 
chronize with other replicas. 

Constructing the tree is easy. When a new replica is 
created for a collection, it asks an existing replica to be its 
parent. If the filter of the requested parent is too restric- 
tive, then the new replica walks up the existing tree until 
it finds a replica that can serve as its parent. At the very 
least, the reference replica can always serve as a parent 
for any replica with an arbitrary filter. If a replica wishes 
to retire gracefully from a collection, then this replica 
should notify its children so they can select a new parent. 
The retiring replica’s parent, for instance, can serve as 
the new parent for its children, or, in some cases, one of 
the existing children can be promoted to be the parent of 
its siblings. A replica can change its parent at any time 
as long as it chooses a new parent with a suitable filter 
and does not violate the tree structure. For instance, a 
replica may be required to find a new parent when it ex- 
pands its filter or its previous parent is unreachable for 
an extended period of time. 
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The tree synchronization topology provides four im- 
portant benefits. 

One, the synchronization topology ensures effective 
connectivity. That is, groups of replicas for the same col- 
lection cannot remain disconnected indefinitely, assum- 
ing periodic synchronization between parents and chil- 
dren. Moreover, each version of an item has a guaranteed 
path by which it can travel from the originating replica 
to any other replica whose filter matches the version. 
Specifically, when a new version is created, it can flow 
up the tree from child to parent replicas until it reaches 
common ancestors, including the reference replica. Any 
versions held by the reference replica can flow to any 
other replica over a path of replicas with increasingly re- 
strictive filters. 

Two, move-out notifications can be delivered by a 
parent to any of its children. Recall from Section 5 
that move-out notifications can be sent when the source 
replica has a filter than is no more restrictive than the tar- 
get. This is exactly the case for replicas with a parent- 
child relationship. Thus, the tree topology guarantees 
that all replicas are able to receive appropriate move-out 
notifications. Essentially, such notifications flow down 
the tree. 

Three, out-of-filter versions in a replica’s push-out 
store flow up the tree until they reach replicas that are 
interested in those items. During synchronization from a 
child replica to its parent, the child sends all of the items 
in its push-out store, regardless of whether they match 
the parent’s filter. The tree topology prevents replicas 
from playing “hot potato” with out-of-filter versions. 

Four, the tree topology ensures eventual knowledge 
singularity. As authoritative versions are passed up the 
tree, a parent replica assumes authority for any versions 
generated by any of its children or their descendants. 
Eventually, all authoritative versions arrive at the ref- 
erence replica, which produces a single star-knowledge 
fragment containing all of these versions. This star- 
knowledge fragment is then passed down the tree from 
the reference replica to all other replicas during parent- 
to-child synchronizations. In the absence of further up- 
dates or filter changes, each replica’s knowledge will 
eventually converge to that of the reference replica. 

Although these benefits argue convincingly for hav- 
ing a tree-structured synchronization topology, extended 
synchronization patterns are not prevented. In Cim- 
biosys, a replica can choose arbitrary synchronization 
partners (in addition to its parent and children). The only 
restriction is that the overall synchronization topology 
must include an embedded tree with a reference replica. 

All practical usage scenarios that we’ve envisioned 
meet this condition. In the photo sharing scenario pre- 
sented in Section 2, Alice’s home PC serves as the refer- 
ence replica for her photo collection. Her laptop and dig- 


ital photo frame synchronize directly with this PC, and 
treat it as their parent, as do the cloud-based services that 
contain selected photos. However, Alice’s laptop might 
also sync with such services on occasion or sync directly 
with friends’ laptops. Cloud-based services might repli- 
cate data among themselves for geographic scaling, un- 
beknownst to the reference replica. The digital camera, 
which only synchronizes with the laptop, uses the lap- 
top as its parent replica. The overlaid tree topology en- 
sures that Alice’s new photos will eventually find their 
way into her master photo collection as well as onto other 
devices with selective filters. 


$8 Evaluation 


In this section, we present an evaluation of Cimbiosys 
based on our two implementations, one in C# for Win- 
dows platforms and one in Mace for Linux platforms. 
In particular, we answer the following questions with re- 
spect to the goals of Cimbiosys: 


e Does Cimbiosys achieve eventual filter consistency 
in the presence of move-outs, out-of-filter updates, 
and changing filters? 


e How concise is the knowledge representation in 
Cimbiosys as compared to protocols with per-item 
knowledge, and does the reduction in knowledge 
size lead to more efficient synchronizations? 


e What are the benefits of leveraging filter re- 
lationships between replicas, and how do non- 
hierarchical synchronizations affect the perfor- 
mance of Cimbiosys? 


8.1 Experiments on the C# implementation 


We performed experiments on the C# implementation by 
running 10 replicas on the same computer. The replicas 
formed a three-level hierarchy based on filter relation- 
ships with one full replica at the top, three partial replicas 
in the middle, and six more partial replicas at the bottom. 
Each replica’s filter was less restrictive than the filters of 
any replica at a lower level. 

The experimental workload had five serial phases con- 
sisting of different kinds of updates to the system. Each 
update consisted of a randomly chosen replica modifying 
the content of a randomly chosen item in its item store. 
Throughout the experiment, replicas synchronized with 
randomly chosen partners at regular intervals. 


1. insert phase: Randomly chosen replicas inserted a 
total of 1000 items into their respective item stores 
at the start of the experiment. 600 synchronizations 
followed the inserts. 


2. update phase: 1000 updates were performed, none 
of which triggered move-outs at any replicas. There 
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Figure 5: Average inconsistent items per replica vs. time 


were 600 synchronizations in this phase, and the up- 
dates happened during the start of the phase at the 
rate of 10 updates between each synchronization. 


3. move-out phase: Replicas updated 100 items; the 
updated content continued to match the updater’s 
filter even though it might move out of other repli- 
cas’ filters. 600 synchronizations followed. 


4. push-out phase: Replicas performed a total of 50 
out-of-filter updates. That is, the updated content 
did not match the updating replica’s filter. Another 
600 synchronizations followed. 


5. filter-change phase: Three randomly chosen par- 
tial replicas changed their filters to new non- 
overlapping filters. A final 300 synchronizations 
ended the experiment. 


We evaluated two variants of the Cimbiosys system. 
The first variant, called C/M-Basic, implemented all the 
core mechanisms described in Section 5 for achieving 
eventual filter consistency. The second variant, called 
CIM-Singular, implemented the additional mechanisms 
for the accumulation of authoritative knowledge in order 
to achieve eventual knowledge singularity as presented 
in Section 6. 


Results 


We first show the progress made by replicas in achieving 
eventual filter consistency. Figure 5 plots the average 
number of inconsistencies in a replica’s item store over 
time. Here, an inconsistency at a replica R at a certain 
time includes three cases: a) an item present in R’s store 
is obsolete, b) the latest version of an item matches R’s 
filter but no version of the item is present in R’s store, 
and c) an item is present in R’s store but does not match 
R’s filter. We counted these inconsistencies by tracking 
the global state of the system. 

Figure 5 confirms that both CIM-Basic and CIM- 
Singular eventually achieve a state of zero inconsisten- 
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Figure 6: Average size of knowledge per replica vs. time 
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Figure 7: Cumulative synchronization overhead incurred 
vs. time 


cies in the presence of partial synchronization, move- 
outs, out-of-filter updates, and filter changes. They also 
converge at the same rate (and the graphs are identical) 
because they share the same core mechanisms to support 
partial replication. 

We next evaluate knowledge compaction in Cim- 
biosys. Figure 6 shows the average size of the knowl- 
edge of each replica over time. As expected, the size of 
knowledge in CIM-Basic increases as updates are per- 
formed and reaches a peak value dependent on the num- 
ber of items stored in the replica and the number of up- 
dates performed to each item. In CIM-Singular, however, 
knowledge is fragmented in the initial stages but eventu- 
ally converges to the size of a single version vector at 
the end of each phase. In other words, CIM-Singular 
achieves eventual knowledge singularity. 

Figure 7 demonstrates the positive effect that knowI- 
edge compaction has on synchronization overhead. It 
shows the cumulative overhead incurred during synchro- 
nizations in the insert and the update phases. The over- 
head includes the cost of transmitting knowledge from 
the target to the source in the initial SyncRequest mes- 
sage and from the source to the target in the final Sync- 
Complete message. 
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Figure 8: Effect of leveraging filter relationships 


Knowledge compaction provides a significant reduc- 
tion in the sync overhead over a period of time as evi- 
dent from the difference between CIM-Basic and CIM- 
Singular in the figure. Low synchronization overhead 
means that replicas can synchronize more often and 
learn updates sooner with the same bandwidth budget. 
It also enables effective synchronization for replicas on 
bandwidth-constrained mobile devices. 


8.2 Experiments on the Mace implementation 


We evaluated the Mace implementation of Cimbiosys 
using ModelNet [21] to simulate a variety of network 
topologies on a cluster of machines. 

For these experiments, we used a system of 10 repli- 
cas, a binary-tree filter hierarchy, and a collection size of 
10,000 items, which reflects the average size of a con- 
sumer photo collection. Using ModelNet, we emulated a 
clique of 10 routers, each connected to a single replica. 
The link speed between all routers and replicas was set 
to 100 Mbps. The trends in the experimental results were 
similar with lower bandwidths. 

Each experiment consisted of two phases. During 
phase 1, replicas created items such that 10,000 total 
items existed in the system at the conclusion of this 
phase. During phase 2, synchronizations proceeded until 
the knowledge at all replicas converged to a stable state. 


Results 


The general trends in the size of knowledge and the sync 
overhead for the MACE experiments were similar to the 
results of the C# experiments discussed earlier, and so 
we do not present them here. Instead, we focus on evalu- 
ating the impacts of filter relationships and synchroniza- 
tion patterns. 

We first discuss the effects of leveraging the hier- 
archical filter relationships overlaid upon the network 
topology. We performed experiments where each replica 
chose a parent or a child as its synchronization partner 
50% of the time and an arbitrary replica at other times. 
In the first experiment, called hierarchy, replicas would 
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Figure 9: Effects of out-of-hierarchy synchronization 


synchronize as parents or children when their filters were 
in the proper relation according to the filter hierarchy. In 
the second experiment, called no hierarchy, every syn- 
chronization was treated as if the filters were unrelated. 

Figure 8 shows the benefits of leveraging parent- 
child relationships between replicas. Replicas can ac- 
cept knowledge from their parents and can then directly 
merge this knowledge with their own, as they know after 
synchronizing with a parent that all versions included in 
the parent’s knowledge should be included in their own. 
Similarly, replicas can become authoritative for versions 
authored by their descendants, and this information can 
flow up the hierarchy until it reaches a reference replica, 
at which point it flows downward in a compact form. 
Without a hierarchy, replicas can only claim authority 
over versions they themselves store. We can still achieve 
eventual knowledge singularity without a filter hierarchy 
but it takes longer for replicas to reach that state. 

Finally, we discuss how the choice of synchronization 
partners (only parent or children versus arbitrary repli- 
cas) affects the performance of Cimbiosys. Figure 9 
compares an experiment in which replicas only synchro- 
nized with their parents and children with an experiment 
in which the replicas selected synchronization peers at 
random. As the figure shows, restricting synchroniza- 
tions to parents and children allows knowledge to con- 
verge much more quickly. This is because knowledge 
tends to flow within a hierarchy in a more compact form. 
On the other hand, synchronizations with arbitrary peers 
may allow quicker exchange of updated items between 
replicas at the cost of increased fragmentation in knowl- 
edge. 


9 Related Work 


The Cimbiosys design presented in this paper builds 
upon previous work on content-based filtering and es- 
pecially weak-consistency replication protocols. In this 
section, we discuss related work with an eye toward how 
the systems fall short of meeting the challenges intro- 
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Selection Effective Out-of-filter 
System criteria Partial sync — connectivity Move-outs updates Filter changes 
Cimbiosys Content-based Item-set Filter-constrained Explicit move-out —Push-out store Knowledge 
filters knowledge embedded tree notifications retraction and 
topology push-out store 
Ficus File IDs Metadata Per-file ring Cannot occur Cannot occur Not addressed 
exchange topology 
PRACTI File IDs / Log exchange Policy Not addressed Not addressed Not addressed 
directories 
EnsemBlue File IDs + Client-server Client-server Not addressed Write back to Not addressed 
persistent queries server 
Perspective Views, 1.e. Log or metadata Not addressed Logged pre and Retain until pulled = Not addressed 
attribute-based exchange post versions by device 
filters 


Table 2: Key design decisions in Cimbiosys and related work. 


duced by content-based replication with a peer-to-peer 
synchronization model, particularly in an environment 
characterized by changing content, user interests, and de- 
vice connectivity. 

The HomeViews system has the similiar goal of sup- 
porting selective data sharing in a peer-to-peer system 
model [6]. It allows users to export their data, includ- 
ing digital photos and other files, as views defined by 
content-based queries written in SQL. Although views 
are essentially equivalent to filters in Cimbiosys, they are 
defined by the data exporter rather than by the devices 
that import the data. Moreover, data is not replicated 
among devices but rather views are accessed remotely 
and searched via distributed queries. 


The filters supported in Cimbiosys also resemble those 
of content-based publish/subscribe systems, though such 
systems offer a completely different replication model [1, 
4]. Subscribers in a pub/sub system advertise their fil- 
ters to a collection of brokers, which build routing tables 
used to route events from a publisher to the set of inter- 
ested subscribers. Each event is independent and stored 
temporarily in the brokers’ message queues. New sub- 
scribers (or those with new filters) observe only future 
events. In Cimbiosys, on the other hand, replicas even- 
tually and persistently store all items that match their fil- 
ters, can update items, and disseminate new and updated 
items among themselves through direct communication. 


Some systems support partial replication but with a 
client-server model. Coda, for instance, allows clients 
to cache some or all of the files residing on a server, 
thereby supporting disconnected operation on mobile de- 
vices [9]. A hoard profile, which could be considered a 
type of filter, specifies the files of interest to each client, 
though Coda clients may cache other files based on ac- 
cess patterns. Clients reconcile their local changes di- 
rectly with the server(s). BlueFS [14] provides a simi- 
lar system model but emphasizes energy efficiency when 
dealing with small, mobile devices. As opposed to Cim- 
biosys, neither Coda nor BlueFS permits clients to share 
updates directly with each other. 
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EnsemBlue [17] extends BlueFS by allowing discon- 
nected clients to organize into a temporary ensemble 
headed by a client acting in place of the server. No- 
tably, EnsemBlue supports persistent queries that can be 
used by clients, along with server-provided callbacks for 
cache invalidation, to provide a form of content-based 
replication. Select operations on files that match a per- 
sistent query are logged by the server in a special file 
that can be retrieved and read by clients. A client then 
explicitly fetches new files that match its query and dis- 
cards updated files that no longer match the query. Un- 
like Cimbiosys, the burden is placed on servers to record 
which files are cached where and on clients to fetch up- 
dated files in order to determine whether the contents are 
of interest. 


Some topology-independent replication systems allow 
arbitrary communication patterns but lack support for 
content-based filters. Bayou, for instance, includes an 
efficient log-based, peer-to-peer synchronization proto- 
col but assumes that all replicas are interested in all 
items [18]. WinFS, like Bayou, maintains a single ver- 
sion vector per replica that is transmitted on every syn- 
chronization, but uses state-exchange rather than log- 
exchange [15]. WinFS supports replication of arbitrary 
file folders but not per-replica filters. Cimbiosys ex- 
tends the WinFS design to support content-based filter- 
ing while ensuring eventual filter consistency; the even- 
tual knowledge singularity property ensures that the per- 
replica overhead converges to a single version vector as 
in Bayou and WinFS. 


A few other systems have combined topology inde- 
pendence with some form of partial replication. One 
early peer-to-peer replication system, Ficus [7], was ex- 
tended to support selective replication [19]. Each replica 
can store an arbitrary subset of a file system volume and 
can alter the set of locally stored files at any time. Be- 
cause the set of interesting files is explicitly specified by 
file ids, and not based on file contents, several of the key 
concerns with content-based filtering do not arise in Fi- 
cus, including out-of-filter updates and move-outs. Syn- 
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chronization is a heavy weight operation since a replica 
must pull information about all of the files stored on a 
remote replica in order to determine those that have been 
updated or newly created. To reduce communication 
costs and ensure effective connectivity, the sites replicat- 
ing a given file are organized into a ring where synchro- 
nizations occur between neighbors in the ring, essentially 
renouncing topology-independence. 

PRACTI is another replication system with topology- 
independence and partial replication (and arbitrary con- 
sistency) [2]. In PRACTI, each replica maintains a log 
of invalidations for objects that have been updated. A 
synchronization protocol similar to Bayou’s exchanges 
log entries between pairs of replicas. Partial replica- 
tion is achieved by allowing replicas to selectively fetch 
invalidated objects. Imprecise invalidations that cover 
a range of objects let partial replicas maintain smaller 
logs. While PRACTI permits each replica to define its 
own “interest set’, the current design equates interest sets 
with file folders, and issues such as effective connectivity 
are left as policy decisions. Adding practical support for 
content-based filtering to PRACTI would require many 
of the techniques developed in Cimbiosys. 

More recently, the Perspective project at CMU has 
been exploring a replication paradigm most closely re- 
sembling that of Cimbiosys, but with a very different 
system design [20]. Each device in Perspective defines 
an attribute-based filter called a “view”. Only files in- 
cluded in a device’s view are stored on the device. Un- 
like Cimbiosys, each device is aware of all other devices 
and their views; hence, Perspective is more suitable for a 
small, fixed set of devices, such as those in a consumer’s 
home media system. Upon updating a file, a device sends 
a notification to all other available devices. Devices, in 
turn, fetch the updated files on demand. A disconnected 
device that misses update notifications is later brought 
up-to-date by synchronizing directly with other devices. 
A device can modify its view at any time, but it must 
inform the other devices and behave as a new replica 
during synchronization to obtain the files that match its 
new view. Cimbiosys, by contrast, allows content-based 
filters, bandwidth-efficient synchronization, incremental 
filter changes, incomplete knowledge of other replicas, 
and arbitrary synchronization partners. 

Table 2 summarizes the key design decisions in previ- 
ous partial replication systems as well as Cimbiosys. It 
focuses on the steps taken by the designers of these sys- 
tems to address the five key challenges of content-based 
partial replication presented in Section 2. 


10 Conclusion 


Cimbiosys is a new storage platform that provides fil- 
tered replication of content through peer-to-peer syn- 
chronization. Its design was motivated by the needs of 


loosely-organized communities and of individuals man- 
aging multiple devices. Cimbiosys allows each device 
to express its individual information needs as a content- 
based filter, permits devices to enter or leave the system 
without global coordination, accommodates dynamically 
changing content and filters, efficiently propagates up- 
dated items while avoiding duplicate delivery, exploits 
opportunistic encounters between devices with overlap- 
ping filters, and supports flexible synchronization topolo- 
gies (within certain constraints). 

Eventual filter consistency, whereby a device’s replica 
converges towards a state containing exactly those items 
that match its filter and nothing more, is achieved 
through a combination of novel technologies and prag- 
matic design decisions. Item-set knowledge, compactly 
represented as one or more version vectors and associ- 
ated items, records not only the versions that have been 
received by a device but also obsolete versions and ver- 
sions of items that no longer match its filter. Given a 
device’s knowledge and filter, the synchronization pro- 
tocol can readily determine exactly those versions of in- 
terest, thus meeting the challenge of partial synchroniza- 
tion. Under specific conditions, devices receive move- 
out notifications during synchronization and can discard 
out-of-filter versions without losing updates. When mod- 
ifying its filter, a device can adjust its knowledge so that 
its local item store is incrementally updated to match its 
new filter. 


Remarkably, knowledge converges towards a single 
version vector for all devices, with full or partially repli- 
cated contents. This eventual knowledge singularity 
property is achieved by ensuring that at least one device 
is authoritative for every version ever generated, trans- 
mitting star-knowledge for authoritative versions during 
synchronization, and compacting knowledge fragments. 
Our experimental evaluation, which was based on imple- 
mentations of our protocol, as well as model checking 
performed on a formal specification, demonstrate that 
eventual knowledge singularity is indeed realized if up- 
dates cease for a sufficiently long period. In a system 
with frequent updates and filter changes, devices may 
never actually reach knowledge singularity, but the tech- 
niques used to drive the system in that direction serve to 
keep knowledge to a manageable size. 


Using the CIM Sync protocol, eventual filter consis- 
tency and knowledge singularity will be attained in sys- 
tems where every device synchronizes occasionally with 
every other device. However, requiring full inter-device 
connectivity is unrealistic in many of the scenarios that 
we wish to support. By enforcing a hierarchically filtered 
tree topology, Cimbiosys maintains the desired proper- 
ties while providing some degree of flexibility in estab- 
lishing synchronization partnerships and still allowing ad 
hoc communication between peers. 
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RPC Chains: Efficient Client-Server Communication 
in Geodistributed Systems 


Yee Jiun Song!” 


Abstract 


We propose the RPC chain, a simple but powerful com- 
munication primitive that allows an application to reduce 
the performance effects of wide-area links on enterprise 
and data center applications that span multiple sites. This 
primitive chains together multiple RPC invocations so 
that the computation can flow from one server to the next 
without involving the client every time. We demonstrate 
that RPC chains can significantly reduce end-to-end la- 
tency and network bandwidth in a storage application 
and a web application. 


1 Introduction 


Distributed enterprise applications, such as web appli- 
cations, are often built from more basic services, such 
as storage services, database management systems, au- 
thentication and configuration services, and services for 
interfacing with external components (e.g., credit card 
processing, banking, vendors, etc). As systems become 
larger, more complex, and more ubiquitous, there is a 
corresponding increase in the number, diversity, and geo- 
graphical dispersion of the remote services that they use. 
For instance, Hotmail and Live Messenger share an ad- 
dress book service and an authentication service; there 
are also services specialized for each application, say, for 
email storage or virus scanning. These services are het- 
erogeneous; they are often developed by different teams 
and are geo-distributed, running in different parts of the 
world. 

Geo-distribution provides many benefits: high avail- 
ability, disaster tolerance, locality, and ability to scale 
beyond one data center or site. However, the thin and 
slow links connecting different sites pose challenges, es- 
pecially in an enterprise setting, where applications have 
strict performance requirements. For instance, web ap- 
plications should ideally respond within one second [13]. 

The most common primitives for inter-service com- 
munication are remote procedure calls (RPC’s) or RPC- 
like mechanisms. RPC’s can impose undesirable com- 
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Figure 1: (Left) Standard RPCs. (Right) RPC chain. 


munication patterns and overheads when a client needs 
to make multiple calls to servers. This is because RPC’s 
impose communication of the form A—B—A (A calls B 
which returns to A) even though this pattern may not be 
optimal. For example, in Figure | left, a client A in site 1 
uses RPC’s to consecutively call servers B, C’, and D in 
site 2. Server B, in turn, calls servers & and F' in site 3. 
The use of RPC’s forces the execution to return to A and 
B multiple times, causing 10 crossings of inter-site links 


We propose a simple but more general communication 
primitive called a Chain of Remote Procedure Calls, or 
simply RPC chain, which allows a client to call multiple 
servers in succession (A—B,—By—.---—A), where the 
request flows from server to server without involving the 
client every time. The result is a much improved commu- 
nication pattern, with fewer communication hops, lower 
end-to-end latency, and often lower bandwidth consump- 
tion. In Figure | right, we see how an RPC chain re- 
duces the number of inter-site crossings to 4. The ex- 
ample in this figure is representative of a web mail ap- 
plication, where host A is a web server that retrieves a 
message from an email server B, then retrieves an as- 
sociated calendar entry from a calendar service C, and 
finally retrieves relevant ads from an ad server D. 


The key idea of RPC chains is to embed the chaining 
logic as part of the RPC call. This logic can be a generic 
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function, constrained by some simple isolation mecha- 
nisms. RPC chains have three important features: 


e (1) Server modularity. What made RPC’s so success- 
ful is the clean decoupling of server code, which al- 
lows servers to be developed independently of each 
other and the client. RPC chains preserve this at- 
tribute, even allowing existing legacy RPC’s to be part 
of a chain through simple wrappers. 

e (2) Chain composability. If a server in the chain itself 
wishes to call another server, this nested call can be 
simply added to the chain in flux. In Figure 1, when 
client A starts the chain, it intends to call only servers 
B,C, and D. But server B wants to call servers FE 
and F’, and so it adds them to the chain. 

e (3) Chain dynamicity. The services that a host calls 
need not be defined a priori; they can vary dynami- 
cally during execution. In the left figure, the fact that 
client A calls servers C' and D need not be known 
before A calls server B; it can depend on the result 
returned by 6. For example, an error condition may 
cause a chain to end immediately instead of continu- 
ing on to the next server. 


We demonstrate RPC chains through a storage and a 
web application. For the storage application, we show 
how a storage server can be enabled to use RPC chains, 
and we give a simple use in which a client can copy data 
between servers without having to handle the data itself. 
This speeds up the copying and saves significant band- 
width. For the web application, we implement a simple 
web mail service that uses chains to reduce the overheads 
of an ad server. 

The paper is organized as follows. We explain the set- 
ting for RPC chains in Section 2. Section 3 covers the 
design of RPC chains and Section 4 covers applications. 
We evaluate RPC chains in Section 5, and we explain 
their limitations in Section 6. A discussion follows in 
Section 7. We discuss related work in Section 8 and we 
conclude the paper in Section 9. 


2 Setting 


We consider enterprise systems that span geographically- 
diverse sites, where each site is a local area network. 
Sites are connected to each other through thinner and 
slower wide area links. Wide-area links can be made 
faster by improving the underlying network, and lots of 
progress has been made here, but this progress is hin- 
dered by economic barriers (e.g., legacy infrastructure), 
technological obstacles (e.g., switching speeds), and fun- 
damental physical limitations (e.g., speed of light). Thus, 
the large discrepancy between the performance of local 
and wide-area links will continue. 

Unlike the Internet as a whole, enterprise systems op- 
erate in a trusted environment with a single adminis- 
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trative domain and experience little churn. These sys- 
tems may contain a wide range of services, often de- 
veloped by many different teams, including general ser- 
vices for storage, database management, authentication, 
and directories, as well as application-specific services, 
such as email spam detection, address book manage- 
ment, and advertising. These services are often accessed 
using RPC’s, which we broadly define as a mechanism in 
which a client sends a request to a server and the server 
sends back a reply. This definition includes many types 
of client-server interactions, such as the interactions in 
CORBA, COM, REST, SOAP, etc. 

In enterprise environments, application developers are 
not malicious though some level of isolation is desirable 
so that a problem in one application or service does not 
affect others. 


3 Design 


We now explain the design of RPC chains, starting with 
the basic mechanism for chaining RPC’s in Section 3.1. 
The code that chains successive RPC’s is stored in a 
repository, explained in Section 3.2. In Section 3.3, we 
cover the state that is needed during the chain execution. 
We then discuss composition of chains in Section 3.4, 
legacy servers in Section 3.5, isolation in Section 3.6, 
debugging in Section 3.7, exceptions in Section 3.8, fail- 
ures in Section 3.9, and chain splitting in Section 3.10. 


3.1 Main mechanism 


Servers provide services in the form of service functions, 
which is the general term we use for remote procedures, 
remote methods, or any other processing units at servers. 
An RPC chain calls a sequence of service functions, pos- 
sibly at different servers. Service functions are connected 
together via chaining functions, which specify the next 
service function to execute in a chain (see Figure 2 top). 
Chaining functions are provided by the client and exe- 
cuted at the server. They can be arbitrary C# methods 
with the restriction that they be stand-alone code, that 
is, code which does not refer to non-local variables and 
functions, so that they can be compiled by themselves. 
We chose this general form of chaining for two rea- 
sons. First, we want to allow the chain to unfold dynami- 
cally, so that the choice of next hop depends on what hap- 
pens earlier in the chain. For example, an error at a ser- 
vice function could shorten a chain. Second, we wanted 
to support server modularity, so that services and client 
applications can be developed independently. Thus, a 
server may not produce output that is immediately ready 
for another server, in the way intended by the client’s 
application. One may need to convert formats, reorder 
parameters, combine them, or even combine the outputs 
from several servers in the chain. For example, an NFS 
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// service function 
object sf(object parmlist) 
// parmlist: parameter list 


// Chaining function 

nexthop cf(object state, object result) 
// state: from client or earlier parts of chain 
// result: from last preceding service function 
// returns next chain hop: 
Ji (server, sf_name, 
yy cf_name, state) 


parnmlist, 


chain_id start_chain(machine_t server, 
string sf.name, object parmlist, 
string cf.name, object state) 


Figure 2: (Top) Signature of a service function (sf) and 
chaining function (cf). (Bottom) Signature of function 
that launches an RPC chain. 
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Figure 3: Execution of an RPC chain (see explanatory 
text in Section 3.1). RPCC stands for RPC chain. 


server does not output data in the format expected by a 
SQL server: one needs glue that will convert the output, 
choose the tables, and add the appropriate SQL wrapper, 
according to application needs. Chaining functions pro- 
vide this glue. We initially considered a simpler alterna- 
tive to chaining functions, in which a client just provides 
a Static list of servers to call, but this design does not ad- 
dress the issues above. We also note that it is easy to 
translate a static server list into the appropriate chaining 
functions (one could even write a programmer tool that 
automatically does that), so our design includes static 
lists as a special case. 

Figure 3 shows how an RPC chain executes. (1) A 
client calls our RPCC (RPC chain) library, specifying a 
server, a reference to a service function sf, at that server, 
its parameters, and a chaining function cf,. (2) This in- 
formation is then sent to the chosen server. (3) The server 
executes service function sf,, which (4) returns a result. 
(5) This result is passed to the chaining function cf,, 
which then (6) returns the next server, service function, 
and chaining function, and (7) the chain continues. 

For example, suppose client A wants to call service 
functions sf, sfc, Sfp at servers B, C, and D, in this or- 
der. To do so, the client specifies a reference to sf, anda 
chaining function cf,. cf, causes a call to sfo at server C 


with a chaining function cf,, which in turn causes a call 
to sfp at server D with a chaining function cf,, which 
causes the final result to be returned to the client A. 


3.2 Chaining function repository 


Chaining functions are provided by clients but executed 
at servers. To save bandwidth, in our implementation the 
client does not send the actual code to the server. Rather, 
the client uploads the code to a repository, and sends a 
reference to the server; the server downloads the code 
from the repository and caches it for subsequent use. The 
repository stores chaining functions in source code for- 
mat, and servers compile the code at runtime using the 
reflection capabilities of .NET/C# (Java has similar ca- 
pabilities). 

We store source code because it introduces fewer de- 
pendencies, is more robust (binary formats change more 
frequently), and simplifies debugging. Because the cost 
of runtime compilation can be significant (~50 ms, see 
Section 5.2.1), servers cache the compiled code, not the 
source code, to avoid repeated compilations. 

When the chaining function is very small, it could be 
transmitted by the client with the RPC chain, so that the 
server does not have to contact the repository. Our im- 
plementation presently does not support this option. 


3.3. Parameters and state 


A chaining function is client logic that may depend on 
run-time variables, tables, or other state from the client 
or earlier parts of the chain. This state needs to be passed 
along the chain, and ideally it should be small, otherwise 
its transmission cost can outweigh the benefits of an RPC 
chain (see Section 5.2.2). We represent the state as a set 
of name-value pairs, which is passed as a parameter to 
the chaining function (see Figure 2). 

The output of each service function is also passed as 
a parameter to the subsequent chaining function. For ex- 
ample, in our storage copy application (Section 4.1), the 
first service function reads a file, and the chaining func- 
tion uses the result as input to the next service function, 
which writes to a file on a different server. In our email 
application, a service function reads an email message, 
and the chaining function adds the message to the state of 
the next chaining function, so that the message is passed 
along the chain back to the chain originator (a mail web 
server). 


3.4 Nesting and composition 


RPC chains can be nested: a service function in a chain 
may itself start a sub-chain. For example, the main chain 
could call a storage service, which then needs to call a 
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Figure 4: Composition of nested chains. (Left) The main 
chain | and a sub-chain 2. (Right) Result and manner of 
composing chains. (I) B starts a sub-chain, causing the 
RPCC library to push the B—C' chaining function and 
its state parameter into a stack. (II) Chaining function 
at F returns an indication that the chain ended and the 
result that B is supposed to produce. This causes the 
RPCC library to pop from the stack, obtaining the B—C' 
chaining function and its state parameter. It then calls 
this chaining function with the result and state. The chain 
now continues at C. 


replica. We implement nesting so that a nested chain can 
be adjoined to an existing chain, as shown in Figure 4. 
Note the difference between starting a chain going from 
B to E, and moving to the next host in a chain going from 
C to D: the former occurs when the service function at 
B starts a new chain, while the latter occurs when the 
chaining function at C calls the next node in the chain. 
This distinction is important because the service function 
at B represents a native procedure at the service, while a 
chaining function at C represents logic coming from A. 
At E, the chaining function that calls F represents logic 
coming from B. 


To compose a chain with its sub-chain, the chaining 
function of the parent chain needs to be invoked when 
a sub-chain ends (to continue the parent chain). Ac- 
cordingly, when a host starts a sub-chain, the RPCC li- 
brary saves the chaining function and its state param- 
eter, and passes them along the sub-chain. The sub- 
chain ends when its chaining function returns null in 
nexthop.server, and a result in nexthop. state (this is 
the result that the host originating the sub-chain must 
produce for the parent chain). When that happens, the 
RPCC library calls the saved chaining function with the 
saved state and nexthop.state. Note that a chain and a 
sub-chain need not be aware of each other for composi- 
tion. 


To allow multiple levels of nesting, we use a chain 
stack that stores the saved chaining function and its state 
for each level of composition. The stack is popped as 
each sub-chain ends. 
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3.5 Handling legacy RPC services 


RPC chains support legacy services that have standard 
RPC interfaces. For that, we use a simple wrapper mod- 
ule, installed at the legacy RPC server, which includes 
the RPCC library and exposes the legacy remote proce- 
dures as service functions. 

Each service function passes requests and responses 
to and from the corresponding legacy remote procedure. 
Because the service function calls the legacy remote pro- 
cedure locally through the RPC’s standard network inter- 
face (e.g., TCP), the legacy server will see all requests as 
coming from the local machine, and this can affect net- 
work address-based server access control policies. (This 
is not a problem if access control is based on internal 
RPC authenticators, such as signatures or tokens, which 
can be passed on by the wrapper.) 

One solution is to re-implement the access con- 
trol mechanism at the wrapper, but this is application- 
specific. A better solution is for the wrapper to fake the 
network address of its requests and capture the remote 
procedure’s output before it is placed on the network. 


3.6 Isolation 


Chaining functions are pieces of client code running at 
servers. Even though clients are trustworthy in the en- 
vironment we consider, they are still prone to buffer 
overruns, crashes, and other problems. Thus, chain- 
ing functions are sandboxed to provide isolation, so that 
client code cannot crash or otherwise adversely affect the 
server on which it runs. 


We need two types of isolation: (1) restricting access 
to sensitive functions, such as file and network I/O and 
privileged operating system calls, and (2) restricting ex- 
cessive consumption of resources (CPU and memory). 

We achieve (1) through direct support by .NET/C# of 
access restrictions to file I/O, system and environment 
variables, registry, clipboard, sockets, and other sensi- 
tive functions (Java has similar capabilities). This is ac- 
complished by placing descriptive annotations, called at- 
tributes, in the source code of chaining functions when 
they are compiled at run-time. 

We achieve (2) by monitoring CPU and memory uti- 
lization and checking that they are within preset val- 
ues. The appropriate values are a matter of policy at 
the server, but for the short-lived type of executions that 
we target with RPC chains, chaining functions should 
consume at most a few CPU seconds and hundreds of 
megabytes of memory, even in the most extreme cases. 

If a chaining function violates restrictions on access or 
resource consumption, an RPC chain exception is thrown 
according to the mechanism in Section 3.8. 
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Another way to isolate chaining functions is to use a 
chaining proxy (Section 7.3). 


3.7 Debugging and profiling 


A very useful debugging tool for traditional applications 
is “printf’’, which allows an application to display mes- 
sages on the console. We provide an analogous facil- 
ity for RPC chain applications: a virtual console, where 
nodes in the chain can log debugging information. The 
contents of the virtual console are sent with the chain, 
and eventually reach the client, which can then dump the 
contents to a real console or file. The virtual console can 
also be used to gather profiling information for each step 
in the chain and be aggregated at the client. 

Even with “printf”, debugging RPC chains can be 
hard, because it involves distributed execution over mul- 
tiple machines. We can reduce this problem to the sim- 
pler problem of debugging RPC-based code by running 
RPC chains in a special interactive mode. The key obser- 
vation is that chaining functions are portable code that 
can be executed at any machine. In interactive mode, 
chaining functions always execute at the client instead of 
the servers. To accomplish this, after each service func- 
tion returns, the RPCC library sends its result back to the 
client, which then applies the chaining function to con- 
tinue the chain from there. A chain executed in interac- 
tive mode looks like a series of RPC calls. By running 
the client in an interactive debugger, the developer can 
control the execution of the chain and inspect the outputs 
of service and chaining functions at each step. 


3.8 Exceptions 


An RPC chain may encounter exceptional conditions 
while it is executing: (1) the next server in the chain 
can be down, (2) the chaining function repository can be 
down, or (3) the state passed to the chaining function can 
be missing vital information due to a bug. All of these 
will result in an exception, either at the RPCC library in 
cases (1) and (2), or at a chaining function in case (3). 
(Service functions do not throw exceptions; they simply 
return an error to the caller.) 

Who should handle such exceptions? One possibility 
is to handle them locally, by having the client send ex- 
ception handling code as part of the chain. Doing this re- 
quires sending all the state that the handling code needs, 
which complicates the application design. Instead, we 
choose a less efficient but simpler alternative (since ex- 
ceptions are the rare case). We simply propagate excep- 
tions back to the client that started the chain. The client 
receives the exception name, its parameters, and the path 
of hosts that the chain has traversed thus far. (If the client 
crashes, the exception becomes moot and is ignored.) 


In the case of nested chains, the exception propagates 
first to the host that started the current sub-chain. If that 
host does not catch the exception, it continues propagat- 
ing to the host that started the parent chain, until it gets 
to the client. For example, in Figure 4 right, if E throws 
an exception (say, because it could not contact F), the 
exception goes to B, the node that created the sub-chain. 
This is a natural choice because B understands the logic 
of the sub-chain that it created, and so it may know how 
to recover from the exception. If B does not catch the 
exception, it is propagated to A. 


3.9 Broken chains 


The crash of a host while it executes an RPC chain results 
in a broken chain. In this section, we describe the broken 
chain detection and recovery mechanisms. 


Detection. We detect a broken chain using a simple 
end-to-end timeout mechanism at the client called chain 
heartbeats: a chain periodically sends an alive message 
to the client that created it, say every 3 seconds, and the 
client uses a conservative timeout of 6 seconds. If there 
are sub-chains, only the top-level creator gets the heart- 
beats. Heartbeats carry a unique chain identifier, a pair 
consisting of the client name and a timestamp, so that the 
client knows to which chain it refers. 


We achieve the periodic sending through a time-to- 
heartbeat timer, which is sent with the chain, and it is 
decremented by each node according to its processing 
time, until it reaches 0, the time to send a heartbeat. Syn- 
chronized clocks are not needed to decrement the timer; 
we only need clocks that run at approximately the same 
speed as real time. Since we do not know link delays, 
we assume a conservative value of 200 ms and decre- 
ment the time-to-heartbeat timer by this amount for ev- 
ery network hop. This assumption may be violated when 
if there is congestion and dropped packets, resulting in 
a premature timeout (false positive). However, the im- 
pact of false positives is small because of our recovery 
mechanism, explained next. 

Recovery. To recover from a broken chain, the client 
simply retransmits the request. Like standard remote 
procedures, we make chains idempotent by including 
a chain-id with each chain, and briefly caching the re- 
sults of service functions and chaining functions at each 
server. If a server sees the same chain-id, it uses the 
cached results for the service and chaining functions. 
The chain can continue in this fashion up to the host 
where the chain previously broke. At that host, if the 
“next” host is still down, an exception is thrown. Alter- 
natively, a fail-over mechanism that calls a backup server 
can be implemented by using logical server names which 
are mapped to a backup when the primary fails. This 
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is similar to the mechanisms used to fail over standard 
RPC’s. 

Upon a second timeout, a client executes the RPC 
chain in interactive mode (as in Section 3.7), to deter- 
mine exactly at which node the chain stopped, and re- 
turns an error to the application. 


3.10 Splitting chains 


For performance reasons, it may be desirable to split 
a chain to allow parallel execution. The decision to 
split a chain should be made with consideration of the 
added complexity, as concurrent computations are al- 
ways harder to understand, design, debug, and maintain 
compared to sequential computations. Although our ap- 
plications do not use splitting chains, we now explain 
how such chains can be implemented. 

Split. We modify chaining functions so that they can 
return more than one nexthop parameter. The RPCC li- 
brary calls each nexthop concurrently, resulting in the 
several split-chains. Each chain has an id comprised of 
the id of the parent plus a counter. For example, if there 
is a 3-way split of chain 74, the split-chains will have 
ids 74.1, 74.2, and 74.3. Each of these split chains can in 
turn be split again, and result in split-chains with increas- 
ingly long ids. For example, if split-chain 74.1 splits 
into two, the resultant split-chains will have ids 74.1.1 
and 74.1.2. We note for future reference that each split- 
chain knows how many siblings it has (this information 
is passed on to the split-chains when the chain splits). 

Broken split chains. Recall that we use an end-to- 
end mechanism to handle broken chains (Section 3.9) via 
a chain heartbeat. When a chain splits, we also split the 
heartbeats: each split-chain sends its own heartbeat (with 
the split-chain 1d) and the client will be content only if it 
periodically sees the heartbeat from all the split-chains. 
The heartbeat messages indicate the number of sibling 
split-chains, so that the client knows how many to expect. 
If a split-chain is missing, the client starts the chain again 
(even if other split-chains are still running, this does not 
cause a problem because of idempotency). 

Merge. To merge split-chains, a merge host collects 
the results of each split-chain and invokes a merge func- 
tion to continue the chain. The merge host and function 
are chosen when the chain splits (they are returned by the 
chaining function causing the split). The merge host can 
be any host; a good choice is the next host in the chain. 
The merge host awaits outcomes from all split-chains be- 
fore calling the merge function, which takes the vector of 
results and returns nexthop, specifying the next service 
function and chaining function to call. 

After split-chains complete (i.e., reach the merge 
host), the parent chain will continue and resume its heart- 
beats. However, split-chains do not necessarily complete 
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at the same time, so there may be a period from when the 
first split-chain completes until the parent chain resumes. 
During this period the merge host sends heartbeats on be- 
half of the completed split-chains, so that the client does 
not time out. 

Crash garbage. When there are crashes in the sys- 
tem, the merge host may end up with the outcome of 
stale split-chains. This garbage can be discarded after 
a timeout: as we mentioned, RPC chains are intended 
for short-lived computations, so we propose a timeout of 
a minute. Note that if a slow system causes a running 
chain to be garbage collected, the client will recover af- 
ter it times out. 


4 Applications 


To demonstrate RPC chains, we apply and evaluate them 
in two important enterprise applications: a storage appli- 
cation (Section 4.1) and a web application (Section 4.2). 


4.1 Storage applications 


Storage services generally provide two basic functions, 
read and write, based on keys, file names, object id’s, 
or other identifiers. While this generic interface is suit- 
able for many applications, its low-level nature some- 
times forces bad data access patterns on applications. For 
instance, if a client wants to copy a large object from one 
storage server to another, the client must read the object 
from one server and write it to the other, causing all the 
data to go through the client. If the client is separated 
from the storage servers by a high latency or low band- 
width connection, this copying could be very slow. 

One solution is to modify the storage service on a case- 
by-case basis for different operations and different set- 
tings. For example, the Amazon S3 storage service re- 
cently added a new copy operation to its interface [2], 
so that an end user can efficiently copy her data be- 
tween data centers in the US and Europe, without hav- 
ing to transfer data through her machine. Although such 
application-specific interfaces can be beneficial, they are 
specific to particular operations and do not mitigate ad- 
verse communication patterns in other settings. 

RPC chains provide a more general solution: they not 
only enable the direct copying of data from one server 
to another (through a simple chain that reads and then 
writes), but also enable broader uses. To demonstrate this 
idea, we layered RPC chains over a legacy NES v3 stor- 
age server, as explained in Section 3.5. (We could have 
used other types of storage, such as an object store.) We 
then implemented a simple chain to copy data without 
passing through the client. 

We also show a more sophisticated application of 
chains by implementing a primary-backup replication of 
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Figure 5: (a) Copying data from storage server | to a replicated storage server 2 without RPC chains. The client reads 
from storage 1 and writes to storage 2; when this happens, storage 2 writes to a backup server. (b) Using a chain to 


copy data and a chain to replicate data (composition disabled). (c) Composing the chains. The chains are not aware of 


each other but the RPCC library can combine them. 


the storage server: when the primary receives a write re- 
quest, it creates a chain to apply the request on a backup 
server. Because replication is done through chains, it can 
be composed with other chains. This is illustrated in Fig- 
ure 5(b), which shows a setup with two storage servers, 
the second of which is replicated, and a user who wants 
to copy data from the first to the second server. Two 
chains are created as a result of this request: a chain that 
the client launches for copying, and another that the sec- 
ond storage server launches for replication. The RPCC 
library allows these two chains to be composed together, 
as shown in Figure 5(c). We report on quantitative bene- 
fits of our approach in Section 5.3. 


4.2 Web mail application 


Web applications are generally composed of multiple 
tiers or services: there are front-end web servers, au- 
thentication servers, application servers, and storage and 
database servers. Some of these tiers, namely the web 
servers and application servers, play the role of orches- 
trating other tiers, and they tend to keep very little user 
state of their own, other than soft session state. This is a 
propitious setting for RPC chains, because performance 
gains can be realized by optimizing the communication 
patterns of the various services. We demonstrate this 
point with a sample application. 

We consider a typical web mail application. There 
are web servers that handle HTTP requests, authenti- 
cation servers and address-book servers that are shared 
with other applications, email storage servers that store 
the users’ mail, and ad servers that are responsible for 
displaying relevant ads. These services can be located 
in multiple data centers, for several reasons: (1) no sin- 
gle data center can host them all; (2) a service may have 
been developed in a particular location and so it is hosted 
close by; (3) for performance reasons, it may be desir- 
able for some services to be located close to their users 
(e.g., users created in Asia may have their mailbox stored 
in Asia), though this is not always achievable (e.g., an 
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Figure 6: A simplified web mail server that uses RPC 
chains. The solid line shows the login sequence followed 
by retrieval of email and ads. The dashed line shows 
how a system based on standard RPC’s would differ. The 
chain is not used for the web client, since it is outside the 
system. It is used in the communication between mail, 
storage, and ad servers. 


Asian user travels to the U.S. and his mailbox is still in 
Asia); and (4) a service may need high availability or the 
ability to withstand disasters. 


We implemented a simple web mail service as shown 
in Figure 6, to study the benefits of RPC chains in such 
a setting. Our web mail system consists of a front- 
end server that authenticates users by verifying their lo- 
gins and passwords. Upon successful authentication, the 
front-end server returns a cookie to the client along with 
the name of an email server. The client then uses the 
cookie to communicate with the email server to send and 
receive email messages. Upon receiving a client request, 
the email server first verifies the cookie, then calls the 
back-end storage server to fetch the appropriate emails 
for the user. Finally, the mail server sends the message 
to an ad server so that relevant ads can be added to the 
messages before they are returned to the client. 


Note that the adding of ads to emails imposes a sig- 
nificant overhead on performance. This is of particular 
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concern because one of the primary performance goals 
of a webmail service is to minimize the response time 
observed by clients. In addition, emails and ads cannot 
be fetched in parallel, since relevant ads cannot be se- 
lected without knowing the contents of the emails. It is 
also difficult to pre-compute the relevant ads because the 
relevance of ads may change over time. 

Using RPC chains, we can mitigate some of the ad- 
related overheads. Even though we can only fetch ads 
after fetching the emails, we can eliminate one latency 
hop from the communication path of the web mail appli- 
cation by creating a chain that causes emails to be sent 
directly from the storage server to the ad server, without 
having to go through to email server (as shown in step 
7 of Figure 6). Once the ad server has appended the ap- 
propriate ads to the emails, the emails can be sent to the 
email server which then returns it to the client. In Sec- 
tion 5.4, we evaluate the benefit of using RPC chains to 
improve the communication pattern in this fashion. 


5 Evaluation 


We now evaluate RPC chains. We start with some mi- 
crobenchmarks, in which we measure the overhead of 
chaining functions and we compare RPC chains versus 
standard RPC’s. We then evaluate the storage and web 
applications to demonstrate the performance improve- 
ments provided by RPC chains. The general question we 
address is when are RPC chains advantageous and what 
are the exact benefits. 


5.1 Setup 


In this section, we present the evaluation of our storage 
and multi-tier web application. Our experimental setup 
consists of ten machines in four geodistributed sites in 
a corporate network that spans the globe. We had ma- 
chines in 4 sites: (1) Mountain View, California, USA, 
(2) Redmond, Washington, USA, (3) Cambridge, United 
Kingdom, and (4) Beijing, China. The measured latency 
and throughput of the links between these sites are shown 
in Figure 7. 


5.2. Microbenchmarks 
5.2.1 Overhead of chaining functions 


In our first experiment, we evaluate the overhead im- 
posed by chaining functions (pieces of client code) at 
servers. We considered chaining functions of three sizes, 
621 bytes, 5 KB, and 50 KB, corresponding to small, 
medium, and large functions. 

We first measured the time it takes to compile a func- 
tion at run-time. The results are shown in the first two 
columns of Figure 8, averaged over 10 runs (+ refers to 
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Figure 7: (a) Ping round-trip times and (b) bandwidth of 
TCP connections between pair of sites. 


Source size Compile time Compiled size 


(KB) (ms) (KB) 
0.6 45.7 + 0.3 0.4 
5 47.1+ 0.4 4.6 
50 76.0 + 0.3 15.9 


Figure 8: Overhead for compiling chaining functions and 
storing compiled code. 


standard error). We used a 3 Ghz Intel Core 2 Duo pro- 
cessor running Windows Vista Enterprise SP1. The func- 
tions were written in C# and compiled using Microsoft 
Visual Studio 2008. 

We also did a linear regression with a larger set of 
points (17 sizes, with 10 runs each) and found that the 
cost of compilation is 44.8 ms plus 1 ms for each 5000 
bytes of source code. We see that there is a large initial 
compilation cost of tens of milliseconds, which we do 
not want to pay every time we call the server in a chain. 

We measured the size of the compiled code, shown 
in the third column of Figure 8. We see that it is very 
small (we initially thought it would be large, but this is 
not the case). This allows the server to cache even tens 
of thousands of chaining functions in less than 50 MB, 
which justifies our choice of doing so. 


5.2.2 RPC chain versus standard RPC 


In our next experiment, we compare the latency of an 
RPC chain versus standard RPC. We used the smallest 
non-trivial chain, which goes through two servers (A 
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Figure 9: Executions used in the experiment of Sec- 
tion 5.2.2. 
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chain that goes through only one server is the same as 
an RPC), and compare it against a pair of consecutive 
RPC’s going to the two servers, as shown in Figure 9. To 
isolate concerns, the service executed at each server is a 
no-op. 

The figure makes it clear that the RPC chain incurs one 
fewer hop than the pair of RPC calls. What is not shown 
is that the RPC chain has potentially two overheads that 
the pair of RPC calls do not: (1) even if the client needs 
the response from server | but server 2 does not, the data 
is still relayed through server 2, and (2) the client needs 
to send state for the chaining function to execute at server 
1. The first overhead can be avoided through a simple 
extension to RPC chains to allow each server in the chain 
to send some data to the client (Section 7.1). 

We now consider the second overhead, and examine 
the question of how much state the client can send while 
still allowing the RPC chain to be faster than the pair 
of RPC calls. We assume that the chaining function is 
already cached at server 1, which is the common case for 
frequent chains. 

Back-of-the-envelope calculation. We start with a 
simple calculation. Let S be the size of the state sent 
by the client for the chaining function at server 1. Then, 
in terms of total latency, the RPC chain saves one net- 
work latency but incurs S'/link_bandwidth to send the 
state. Thus, the RPC chain fares better as long as 
link latency > S/link bandwidth, or 


S < link_latency x link_bandwidth 


For wide area links, the latency-bandwidth product 
can easily be in the tens to hundreds of kilobytes or more. 

Experiment. We executed the RPC chain and the pair 
of RPC’s. The client was located in Redmond while the 
servers were in Mountain View. (Because both servers 
were in the same site, this setup favors the RPC chain by 
an additional network latency; we later explain the case 
when the servers are far apart.) 

Figure 10 shows the client end-to-end latency as a 
function of the state size (error bars show standard er- 
ror). For the standard RPC execution, state size does not 
affect total latency, since this state simply stays at the 
client. The total latency was 75+1 ms. For the RPC 
chain, the latency naturally increases with the state size. 
The point at which both lines cross is at 150 KB. This 
is a fair amount of state to send in many cases—definitely 
much more than we needed in either of our applications. 

If servers 1 and 2 were far apart, this would shift the 
RPC chain line up by the corresponding extra latency. 
For example, if the latency from server 1 to server 2 were 
15 ms, the lines would cross at 100 KB (assuming the 
distance from client to server 2 remains the same), which 
is still a reasonable state size (and much more than we 
needed in our applications). 
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Figure 10: Execution time using an RPC chain versus 
standard RPC to call 2 servers. 


5.3. Storage application 


We now evaluate the use of RPC chains for the storage 
application described in Section 4.1. 


5.3.1 Copy performance 


In our experiments, we copy data from one storage server 
to another using two utilities: one that uses RPC chains, 
called Chain copy, and another that uses standard RPC’s, 
called RPC copy. Both utilities use pipelining, so that 
the client has multiple outstanding requests on either 
server. We also tried using the operating-system pro- 
vided “copy” program, but it performed much worse than 
either Chain copy or RPC copy, because it it reads and 
writes one chunk of data at a time (no pipelining). 

In our first experiment, a single client copies a file of 
variable size (25 KB, 100 KB, 250 KB, and 500 KB) 
between two servers, and we measure the time it takes. 
We vary the location of the client (Mt. View, Redmond, 
Beijing) and fix the location of the servers in Mt. View. 
In the setting where both the client and the servers were 
in Mt. View, we placed them in two separate subnets, 
where the ping latency between the two was 2 ms and 
TCP bandwidth was 10 MB/s. 

Figure 11 shows the results. Each bar represents the 
median of 40 repetitions of the experiment. As we can 
see, Chain copy provide considerable benefits in every 
case, compared to RPC copy. The benefits are greater 
for larger files and longer distances between client and 
servers. In a local setting, the copying time is reduced 
by up to factor of 2, while in the longest-distance setting 
(Beijing-Mt. View), the reduction is up to a factor of 5. 

Another benefit of using Chain copy (not shown) is a 
reduction by a factor of two in (a) the aggregate network 
bandwidth consumption, and (b) the client bandwidth 
consumption. This reduction is important because links 
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Figure 11: Comparison of RPC copy and Chain copy under various settings. (Left) Client and servers are in the same 
site in Mt. View. (Center) Client is in Redmond and servers are in Mt. View. (Right) Client is in Beijing and servers 


are in Mt. View. 
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Figure 12: Throughput-latency of RPC copy and Chain 
copy. Latency is the time to copy a 128 KB file, and 
throughput is the rate at which files are copied. 


connecting data centers have limited bandwidth and/or 
are priced based on the bandwidth used. 


In our next experiment, we vary the number of clients 
simultaneously copying files from one server to another, 
and measure the resultant throughput and latency of the 
system. This allows us to observe the behavior of the 
system under varying load as well as measure the peak 
throughput of the system. As before, the client machine 
was located in Redmond and the servers were located in 
Mt. View. We ran multiple client instances in parallel on 
the client machine, each client copying 1000 files in suc- 
cession, each file measuring 256 KB in size. We measure 
the time that each client takes to complete copying 1000 
files, and compute conservative throughput and latency 
numbers based on the slowest client. 


Figure 12 shows the results of the experiment. For 
both RPC copy and Chain copy, the average latency de- 
creases as the amount of workload placed on the sys- 
tem increases. Initially, the increase in workload also 
results in an increase in the aggregate throughput of the 
system, but once the system becomes saturated, any in- 
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crease in workload only increases latency without any 
gain in throughput. Our results show that RPC copy 
is able to sustain a peak throughput of 4.5 MB/s. This 
peak throughput occurs when the network link between 
the client and the servers, which had a bandwidth of 6.3 
MB/s, becomes saturated. Since Chain copy does not 
require that the data blocks of the files being copied ac- 
tually flow through the client, it was not subject to this 
limitation and was thus able to achieve a higher peak 
throughput of 10.4 MB/s. Rather than a network band- 
width limitation, Chain copy’s throughput is limited by 
the servers’ ability to keep up with requests. 


5.3.2 Benefit of chain composition 


In this experiment, we measure the benefit of compos- 
ing RPC chains. We use two chains: one for copying 
from one server to another (as above) and the other for 
primary-backup replication of the second server (as in 
Figure 5). We compare two systems that use RPC chains; 
one system uses chain composition to combine the two 
chains, while the other has composition disabled. In the 
experiment, one client copies one file of variable size 
from the non-replicated server to the replicated server. 
The client is in Cambridge, the source server is in Mt. 
View, the primary of the destination server is in Mt. 
View, and the backup of the destination server is in Red- 
mond. 

Figure 13 shows the result. As we can see, composing 
the chain reduces the duration of the copy by 12%-20%, 
with larger files having a greater reduction. Without 
composition, the destination server has to handle both re- 
quests from the source server as well as the replies from 
the backup server. Composition reduces the load on the 
destination server by allowing the backup server to send 
replies directly to the client. In addition, composition 
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Figure 13: Benefit of chain composition. 


O Standard RPC 
= RPC Chain 
a 











| 
2 KB 


latency (s) 
Oo PS WwW fF OI DMD 





8KB  32KB 
email size 


128KB 512 KB 


Figure 14: RPC chain in web mail application. 


eliminates the unnecessary messages from the backup 
server to the destination server, reducing the amount of 
bandwidth consumption. A combination of these factors 
allow composition to improve the overall performance of 
the system. As file size increases, the setup cost becomes 
relatively small compared to the actual cost of executing 
the chains. This makes the impact of the more efficient 
chain that resulted from composition more apparent. 


5.4 Web mail application 


We now describe the evaluation of the web mail applica- 
tion presented in Section 4.2. In our experimental setup, 
we placed the client in Mountain View, the mail server 
and the authentication server in Redmond, and all other 
servers in Beijing. This setup emulates the case where 
a user from Asia travels to the US and wants to access 
web mail services that are hosted in Asia. Since the web 
mail provider may have servers deployed worldwide, the 
user can be directed to a mail server and an authentica- 
tion server (Redmond) that is close to his current location 
(Mountain View). However, user-specific data is stored 
on servers close to the user’s normal location (Beijing), 
so the mail server has to fetch data from those machines. 

Specifically, after receiving a cookie from the client 
and verifying the client’s identity, the mail server must 
fetch the client’s email from the storage server followed 
by appropriate ads from the ad server, both of which are 
located in Beijing. A traditional system implemented us- 


ing RPC’s would have the mail server contact the stor- 
age server, fetch the user’s emails, then contact the ad 
server to retrieve relevant ads. However, in our setting, 
where the mail server is located close to the client but 
far away from the storage server and ad server, travers- 
ing the long links between Redmond and Beijing four 
times would be less than ideal. As described in Sec- 
tion 4.2, RPC chains allow us to eliminate unnecessary 
network traversals. In this case, our RPC-chain-enabled 
mail server sends emails directly from the storage server 
to the ad server before returning the result to the mail 
server, halving the number of long link traversals. 


We measure the client perceived latency of opening an 
inbox and retrieving one email: the client first contacts 
the front end authentication server to authenticate her- 
self, then she sends a read request to the mail server to 
retrieve a single email. We measure the time it takes for 
the client to receive the email, which is appended with 
an ad whose size is small relative to the size of the email. 
We vary the size of the email that is fetched, and for each 
size, we repeated the experiment 20 times. 


As shown in Figure 14, RPC chains consistently re- 
duces the client perceived latency of the web mail appli- 
cation. As the size of the email increases, the latency 
improvement from using RPC changes also increases. 
Overall, we found that the use of RPC chains reduced 
the latency of the web mail application by 40% to 58% 
when compared to standard RPC’s. 


We note that the significant performance gains of us- 
ing RPC chains comes at a very low cost of implementa- 
tion. For the web mail application, the effort involved in 
enabling RPC chains was mainly in terms of implement- 
ing chaining functions which totaled a mere 48 lines of 
C# code. In general, a simple way for existing applica- 
tions to benefit from RPC chains is to identify the critical 
causal path of RPC requests, and replace that path with 
an RPC chain. The effort is that of writing a single RPC 
chain; in the worst case, one can do it from scratch. The 
harder problem is finding the critical causal path, which 
has been an active area of research (e.g., [1]). 


6 Limitations 


We now describe some limitations of RPC chains. 


Chaining state cannot always be sent. RPC chains 
are not appropriate if the chaining state is large or if it 
cannot be determined when the client starts the chain. 
For example, suppose that (1) A calls B using an RPC, 
(2) A gets a reply, and (3) depending on the state of a 
sensor or some immediate measurement at A, A then 
calls C or D. It is not possible to use an RPC chain 
A—B-—(C or D), because the choice of going to C ver- 
sus D must be made at A where the sensor is. 
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Programming with continuations. To use RPC 
chains, developers need to make use of continuation- 
style programming. This can be much harder than pro- 
gramming using sequential code, because continuations 
must explicitly keep track of all their state. Continuations 
are notoriously hard to debug, because there is no simple 
way to track the execution that led to a given state. 

We note, however, that programming with continua- 
tions is already tolerated in code that uses asynchronous 
RPC’s and callbacks. Moreover, one could perhaps write 
a tool that automatically produces continuations from se- 
quential code, using techniques from the compiler litera- 
ture (see, e.g., [3]). 

Terminating chains. When an application terminates, 
it is usually desirable to release its resources and halt all 
its activities. However, if the application has outstanding 
RPC chains, it is not easy to terminate them. This prob- 
lem exists with traditional RPC’s as well (there is no easy 
way to terminate a remote procedure), but it is worse with 
RPC chains because the remote servers involved may not 
be known. 

RPC chains are designed for relatively short-lived ex- 
ecutions, and for these uses, this problem is less of a con- 
cern, because a chain soon terminates anyways. The only 
exception is a buggy chain that runs forever. For such 
chains, the RPCC library can impose a maximum chain 
length, say 2000 hops, and throw an exception after that. 


7 Extensions 


We now discuss some extensions of RPC chains. 


7.1 Intermediate chain results 


If a client wants to receive some results from inter- 
mediate servers of the chain, these results need to be 
relayed through the chain. If the amount of data is 
large, it can impose a significant overhead. We can ex- 
tend RPC chains to address this issue, by allowing each 
server in the chain to directly return some data to the 
client. This data is application-specific and is returned 
by the chaining function. Thus, we add a new field, 
client-response, to the nexthop result of a chaining 
function. The RPCC library sends client-response to 
the client concurrently with continuing the chain. 

What happens under chain composition? In this case, 
the “client” that gets client-response 1s the server that 
created the sub-chain. The name of these creators, at 
each level of composed chain, are kept in the chain stack 
(the chain stack is explained in Section 3.4). 


7.2 Dealing with large chaining states 


The chaining state is the state that the client sends along 
the chain to execute the chaining functions. If this state 
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is large, this can incur a significant state overhead. Two 
optimizations are possible to mitigate this cost. 

Fall-back to standard RPC. As explained in Sec- 
tion 3.7, we can execute a chain in interactive mode, 
which causes the chain to go back to the client at ev- 
ery step. This is effectively a fall-back to standard RPC, 
causing all chaining functions to execute at the client, 
which eliminates the overhead of sending the chaining 
state, at the cost of extra network delays. We explored 
this trade-off in Section 5.2.2. It is possible to have the 
RPCC library gauge the size of the chaining state before 
starting the chain, and if the state is larger than some 
threshold, execute the chain in interactive mode. The 
threshold can be chosen dynamically based on previous 
executions of the same chain, in an adaptive manner. By 
doing so, an RPC chain will always perform at least as 
well as standard RPC’s, modulo the small computational 
overhead of executing chaining functions and the time it 
takes to adapt. However, in the applications we examined 
in this paper, we did not need this technique because the 
chaining state was always small. 

Hiding latency. In our implementation, servers wait 
to receive the chaining state before executing the next 
service function in the chain. This waiting is not nec- 
essary, because the service function depends only on its 
parameters, not on the chaining state (the chaining state 
is only needed for the chaining function, which executes 
later). Therefore, a natural optimization is to start the ser- 
vice function even as the chaining state is being received. 
If the service function takes significant time to complete, 
(e.g., it involves disk I/O or some lengthy computation), 
this will mask part or all of the latency of transmitting 
the chaining state. 


7.3. Chaining proxy 


As we said, chaining functions are portable code that do 
not have to execute at the server. They can execute at a 
designated chaining proxy machine, to avoid any over- 
head at the server. Doing so incurs extra communication, 
but if the chaining proxy is geographically close to the 
server, this cost is small relative to that of a wide-area 
hop. To choose the chaining proxy, we can use a simple 
mapping from servers to nearby proxies configured by an 
administrator. 


$8 Related work 


RPC chains utilize two well-understood ideas in the con- 
text of remote execution: function shipping, and contin- 
uations. 

Function shipping is the general technique of sending 
computation to the data rather than bringing the data to 
the computation. It is used in some systems where the 
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cost of moving data is large compared to the cost of mov- 
ing computation. For example, Diamond [10] is a stor- 
age architecture in which applications download search- 
let code to disk to perform efficient filtering of large data 
sets locally, thereby improving efficiency. RPC chains 
use function shipping to send chaining logic. 


A continuation [17] refers to the shifting of program 
control and transfer of current state from one part of a 
program to another. Extending this to distributed contin- 
uations is a natural step, allowing a continuation to shift 
program control from one processor to another. Several 
works in the parallel programming community give high- 
level programming continuation constructs and specify 
their behavior formally, e.g., [12, 11]. Distributed con- 
tinuations were exploited to enhance the functionality of 
web servers and overcome the stateless nature of HTTP 
interaction. By comparison, the RPC chain is a generic 
mechanism that is independent of the service provided 
by servers. RPC chains support complex chaining struc- 
tures, and can be used with a diverse set of servers. 


The above mentioned ideas for code mobility, and oth- 
ers, are leveraged in a variety of high-level program- 
ming paradigms for distributed execution. Distributed 
workflows, e.g., [5, 22], can use distributed continua- 
tions to distribute a workflow description in a decen- 
tralized fashion. MapReduce [6], and Dryad [23] are 
programming models for data-parallel jobs, such as a 
data mining calculations, which process large amounts of 
data in batches. These systems target self-contained jobs 
that execute for substantial periods, while RPC Chains 
are intended for short-lived remote executions in an en- 
vironment with many diverse services that are possi- 
bly developed independently of their applications. Mo- 
bile agents have been extensively studied in the litera- 
ture and many systems have been built, including Tele- 
script/Odyssey [19], Aglets [4], D’Agents [8], and others 
(see e.g., [20, 9]). A mobile agent is a process that can 
autonomously migrate itself from host to host as it ex- 
ecutes; migration involves moving the process’s current 
state to the new host and resuming execution. The mo- 
tivation for mobile agents include (a) bringing processes 
closer to the resources they need in a given stage of the 
computation, and (b) allowing clients to disconnect from 
the network while an agent executes on their behalf. An 
RPC Chain can be considered as a mobile agent whose 
purpose is to execute a series of RPC calls. However, 
mobile agents are much more general and ambitious than 
RPC Chains (which possibly contributed to their even- 
tual demise): they have social abilities, being able to ad- 
just their behavior according to the host in which they are 
currently executing; they can learn about execution envi- 
ronments never envisioned by their creators; and they can 
persist if the clients that created them disappear. Much 
of the literature regarding mobile agents is about security 


(how agents can survive malicious hosts, and how hosts 
can protect themselves against malicious agents) and lan- 
guage support for code mobility (how to write programs 
that can transparently move to other machines). For RPC 
chains, security is a smaller concern in the trusted data 
center and enterprise environments that we consider, and 
we are not concerned about transparent mobility. 

Some related work includes more targeted uses of mo- 
bile code. Work on Active Networks introduced network 
packets called capsules, which carry code that network 
switches execute to route the packet (see [18] for a sur- 
vey). This provides a general scheme for extending net- 
work protocols beyond the existing deployed base, and 
allows for more dynamic routing schemes. In contrast, 
RPC chains are aimed at higher-level applications, and 
their main purpose is to eliminate communication hops 
when a client needs to call many services in succession. 

Distributed Hash Tables (e.g., Chord [16], CAN [14], 
Pastry [15], Tapestry [24]) have a lookup protocol, for 
finding the host responsible for a given key. Such proto- 
cols generally need to contact several hosts successively, 
and this can be done in two ways. In an interactive 
lookup protocol, the host that initiates the lookup opera- 
tion issues RPC’s to each host in succession. A recursive 
lookup protocol [7] works like a routing protocol: the 
host that initiates the operation contacts the first host in 
the sequence, which in turn contacts the next one, and so 
forth; when a host finds the key, it contacts the request 
initiator directly. This protocol is hard-coded into the 
lookup operation, and it is executed by a set of servers 
that implement this operation. In contrast, RPC chains 
provide a generic chaining mechanism that is indepen- 
dent of the operation (service function) executed. 

Finally, SOAP [21] is a protocol that supports RPC’s 
using XML over HTTP. It has the notion of intermedi- 
aries that can process a SOAP message (RPC) before it 
reaches the final recipient. However, there is no client 
logic that routes and transform messages, and the notion 
of a pre-specified distinguished final recipient is inher- 
ent to SOAP. Typical uses for intermediary nodes include 
blocking messages (firewall), buffering and batching of 
messages, tracing, and encrypting/decrypting messages 
as it passes through an untrusted domain. 


9 Conclusion 


We proposed the RPC chain, a simple but powerful 
primitive that combines multiple RPC invocations into 
a chain, in order to optimize the communication pattern 
of applications that use many composite services, possi- 
bly developed independently of each other. With RPC 
chains, client can save network hops, resulting in con- 
siderably smaller end-to-end latencies in a geodistributed 
setting. Clients can also save bandwidth because they are 
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not forced to receive data they do not need. We demon- 
strated the use of RPC chains for a storage and a web 
application, and we think RPC chains could have many 
more applications beyond those. 
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Abstract 


In this paper we present Botlab, a platform that con- 
tinually monitors and analyzes the behavior of spam- 
oriented botnets. Botlab gathers multiple real-time 
streams of information about botnets taken from distinct 
perspectives. By combining and analyzing these streams, 
Botlab can produce accurate, timely, and comprehensive 
data about spam botnet behavior. Our prototype system 
integrates information about spam arriving at the Univer- 
sity of Washington, outgoing spam generated by captive 
botnet nodes, and information gleaned from DNS about 
URLs found within these spam messages. 

We describe the design and implementation of Botlab, 
including the challenges we had to overcome, such as 
preventing captive nodes from causing harm or thwart- 
ing virtual machine detection. Next, we present the re- 
sults of a detailed measurement study of the behavior of 
the most active spam botnets. We find that six botnets 
are responsible for 79% of spam messages arriving at the 
UW campus. Finally, we present defensive tools that take 
advantage of the Botlab platform to improve spam filter- 
ing and protect users from harmful web sites advertised 
within botnet-generated spam. 


1 Introduction 


Spamming botnets are a blight on the Internet. By some 
estimates, they transmit approximately 85% of the 100+ 
billion spam messages sent per day [14, 21]. Botnet- 
generated spam is a nuisance to users, but worse, it can 
cause significant harm when used to propagate phishing 
campaigns that steal identities, or to distribute malware 
to compromise more hosts. 

These concerns have prompted academia and industry 
to analyze spam and spamming botnets. Previous stud- 
ies have examined spam received by sinkholes and pop- 
ular web-based mail services to derive spam signatures, 
determine properties of spam campaigns, and character- 
ize scam hosting infrastructure [1, 39, 40]. This analysis 
of “incoming” spam feeds provides valuable information 
on aggregate botnet behavior, but it does not separate ac- 
tivities of individual botnets or provide information on 
the spammers’ latest techniques. Other efforts reverse 
engineered and infiltrated individual spamming botnets, 
including Storm [20] and Rustock [5]. However, these 
techniques are specific to these botnets and their com- 
munication methods, and their analysis only considers 
characteristics of the “outgoing” spam these botnets gen- 


erate. Passive honeynets [13, 27, 41] are becoming less 
applicable to this problem over time, as botnets are in- 
creasingly propagating via social engineering and web- 
based drive-by download attacks that honeynets will not 
observe. Overall, there is still opportunity to design de- 
fensive tools to filter botnet spam, identify and block 
botnet-hosted malicious sites, and pinpoint which hosts 
are currently participating in a spamming botnet. 


In this paper we turn the tables on spam botnets by us- 
ing the vast quantities of spam that they generate to mon- 
itor and analyze their behavior. To do this, we designed 
and implemented Botlab, a continuously operating bot- 
net monitoring platform that provides real-time informa- 
tion regarding botnet activity. Botlab consumes a feed of 
all incoming spam arriving at the University of Washing- 
ton, allowing it to find fresh botnet binaries propagated 
through spam links. It then executes multiple captive, 
sandboxed nodes from various botnets, allowing it to ob- 
serve the precise outgoing spam feeds from these nodes. 
It scours the spam feeds for URLs, gathers information 
on scams, and identifies exploit links. Finally, it corre- 
lates the incoming and outgoing spam feeds to identify 
the most active botnets and the set of compromised hosts 
comprising each botnet. 

A key insight behind Botlab is that the combination of 
both incoming and outgoing spam sources 1s essential for 
enabling a comprehensive, accurate, and timely analysis 
of botnet behavior. Incoming spam bootstraps the pro- 
cess of identifying spamming bots, outgoing spam en- 
ables us to track the ebbs and flows of botnets’ ongoing 
spam campaigns and establish the ground truth regard- 
ing spam templates, and correlation of the two feeds can 
classify incoming spam according to botnet that is sourc- 
ing it, determine the number of hosts active within each 
botnet, and identify many of these botnet-infected hosts. 


1.1 Contributions 


Our work offers four novel contributions. First, we tackle 
many of the challenges involved in building a real-time 
botnet monitoring platform, including identifying and 
incorporating new bot variants, and preventing Botlab 
hosts from being blacklisted by botnet operators. 
Second, we have designed network sandboxing mech- 
anisms that prevent captive bot nodes from causing harm, 
while still enabling our research to be effective. As well, 
we discuss the long-term tension between effectiveness 
and safety in botnet research given botnets’ trends, and 
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we present thought experiments that suggest that a de- 
termined adversary could make it extremely difficult to 
conduct future botnet research in a safe manner. 

Third, we present interesting behavioral character- 
istics of spamming botnets derived from our multi- 
perspective analysis. For example, we show that just 
a handful of botnets are responsible for most spam re- 
ceived by UW, and attribute incoming spam to specific 
botnets. As well, we show that the bots we analyze use 
simple methods for locating their command and control 
(C&C) servers; if these servers were efficiently located 
and shut down, much of today’s spam flow would be dis- 
rupted. As another example, in contrast to earlier find- 
ings [40], we observe that some spam campaigns utilize 
multiple botnets. 

Fourth, we have implemented several prototype de- 
fensive tools that take advantage of the real-time in- 
formation provided by the Botlab platform. We have 
constructed a Firefox plugin that protects users from 
scam and phishing web sites propagated by spam bot- 
nets. The plug-in blocked 40,270 malicious links em- 
anating from one botnet monitored by Botlab; in con- 
trast, two blacklist-based defenses failed to detect any of 
these links. As well, we have designed and implemented 
a Thunderbird plugin that filters botnet-generated spam. 
For one user, the plugin reduced the amount of spam that 
bypassed his SpamAssassin filters by 76%. 

The rest of this paper is organized as follows. Sec- 
tion 2 provides background material on the botnet threat. 
Section 3 discusses the design and implementation of 
Botlab. We evaluate Botlab in Section 4 and describe ap- 
plications we have built using it in Section 5. We discuss 
our thoughts on the long-term viability of safe botnet re- 
search in Section 6. We present related work in Section 7 
and conclude in Section 8. 


2 Background on the Botnet Threat 


A botnet is a large-scale, coordinated network of comput- 
ers, each of which executes specific bot software. Botnet 
Operators recruit new nodes by commandeering victim 
hosts and surreptitiously installing bot code onto them; 
the resulting army of “zombie” computers is typically 
controlled by one or more command-and-control (C&C) 
servers. Botnet operators employ their botnets to send 
spam, scan for new victims, steal confidential informa- 
tion from users, perform DDoS attacks, host web servers 
and phishing content, and propagate updates to the bot- 
net software itself. 

Botnets originated as simple extensions to existing In- 
ternet Relay Chat (IRC) softbots. Efforts to combat bot- 
nets have grown, but so has the demand for their services. 
In response, botnets have become more sophisticated and 
complex in how they recruit new victims and mask their 
presence from detection systems: 
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Propagation: Malware authors are increasingly relying 
on social engineering to find and compromise victims, 
such as by spamming users with personal greeting card 
ads or false upgrade notices that entice them to install 
malware. As propagation techniques move up the proto- 
col stacks, the weakest link in the botnet defense chain 
becomes the human user. As well, systems such as pas- 
sive honeynets become less effective at detecting new 
botnet software, instead requiring active steps to gather 
and classify potential malware. 


Customized C&C protocols: While many of the older 
botnet designs used IRC to communicate with C&C 
servers, newer botnets use encrypted and customized 
protocols for disseminating commands and directing 
bots [7, 9, 33, 36]. For example, some botnets communi- 
cate via HTTP requests and responses carrying encrypted 
C&C data. Manual reverse-engineering of bot behavior 
has thus become time-consuming if not impossible. 


Rapid evolution: To evade detection from trackers 
and anti-malware software, some newer botnets morph 
rapidly. For instance, most malware binaries are often 
packed using polymorphic packers that generate differ- 
ent looking binaries even though the underlying code 
base has not changed [29]. Also, botnet operators are 
moving away from relying on a single web server to host 
their scams, and instead are using fast flux DNS [12]. 
In this scheme, attackers rapidly rebind the server DNS 
name to different botnet IP addresses, in order to defend 
against IP blacklisting or manual server take-down. Fi- 
nally, botnets also make updates to their C&C protocols, 
by incorporating new forms of encryption and command 
distribution. 

Moving forward, analysis and defense systems must 
contend with the increasing sophistication of botnets. 
Monitoring systems must be pro-active in collecting and 
executing botnet samples, as botnets and their behavior 
change rapidly. As well, botnet analysis systems will 1n- 
creasingly have to rely on external observations of botnet 
behavior, rather than necessarily being able to crack and 
reverse engineer botnet control traffic. 


3 The Botlab Monitoring Platform 


The Botlab platform produces fresh information about 
spam-oriented botnets, including their current cam- 
paigns, constituent bots, and C&C servers. Botlab par- 
tially automates many aspects of botnet monitoring, re- 
ducing but not eliminating the manual effort required of 
a human operator to analyze new bot binaries and incor- 
porate them into Botlab platform. 
Botlab’s design was motivated by four requirements: 


1. Attribution: Botlab must identify the spam botnets 
that are responsible for campaigns and the hosts that 
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Figure 1: Botlab Architecture. Botlab coordinates and monitors multiple source of data about spam botnets, including incoming 


spam from the University of Washington, and outgoing spam generated by captive bot nodes. 


belong to those botnets. 


2. Adaptation: Botlab must track changes in the bot- 
nets’ behavior over time. 


3. Immediacy: Because the value of information about 
botnet behavior degrades quickly, Botlab must pro- 
duce information on-the-fly. 


4. Safety: Botlab must not cause harm. 


There is a key tension in our work between safety and 
effectiveness, similar to tradeoff between safety and fi- 
delity identified in the Potemkin honeyfarm [34]. In Sec- 
tion 6, we discuss this tension in more detail and com- 
ment on the long-term viability of safe botnet research. 

Figure | shows the Botlab architecture. We now de- 
scribe Botlab’s main components and techniques. 


3.1 Incoming Spam 


Botlab monitors a live feed of spam received by approx- 
imately 200,000 University of Washington e-mail ad- 
dresses. On average, UW receives 2.5 million e-mail 
messages each day, over 90% of which is classified as 
spam. We use this spam feed to collect new malware 
binaries, described next, and within Botlab’s correlation 
engine, described in Section 3.5. 


3.2 Malware Collection 


Running captive bot nodes requires up-to-date bot bi- 
naries. Botlab obtains these in two ways. First, many 
botnets spread by emailing malicious links to victims; 
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accordingly, Botlab crawls URLs found in its incom- 
ing spam feed. We typically find approximately 100,000 
unique URLs per day in our spam feed, 1% of which 
point to malicious executables or drive-by downloads. 
Second, Botlab periodically crawls binaries or URLs 
contained in public malware repositories [3, 25] or col- 
lected by the MWCollect Alliance honeypots [22]. 


Given these binaries, a human operator then uses Bot- 
lab’s automated tools for malware analysis and finger- 
printing to find bot binaries that actively send spam, as 
discussed next in Section 3.3. Our experience to date 
has yielded two interesting observations. First, though 
the honeypots produced about 2,000 unique binaries over 
a two month period, none of these binaries were spam- 
ming bots. A significant fraction of the honeypot binaries 
were traditional IRC-based bots, whereas the spamming 
binaries we identified from other sources all used non- 
IRC protocols. This suggests that spamming bots prop- 
agate through social engineering techniques, rather than 
the automated compromise of remote hosts. 


Second, many of the malicious URLs seen in spam 
point to legitimate web servers that have been hacked 
to provide malware hosting. Since malicious pages are 
typically not linked from the legitimate pages on these 
web servers, an ordinary web crawl will not find them. 
This undermines the effectiveness of identifying mali- 
cious pages using exhaustive crawls, an hypothesis that 
is supported by our measurements in Section 5. 
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3.3. Identifying Spamming Bots 


Botlab executes spamming bots within sandboxes to 
monitor botnet behavior. However, we must first prune 
the binaries obtained by Botlab to identify those that cor- 
respond to spamming bots and to discard any duplicate 
binaries already being monitored by Botlab. 

Simple hashing is insufficient to find all duplicates, 
as malware authors frequently repack binaries or release 
slightly modified versions to circumvent signature-based 
security tools. Relying on anti-virus software is also im- 
practical, as these tools do not detect many new malware 
variants. 

To obtain a more reliable behavioral signature, Bot- 
lab produces a network fingerprint for each binary it 
considers. A network fingerprint captures informa- 
tion about the network connections initiated by a bi- 
nary. To obtain it, we execute each binary in a safe 
sandbox and log all outbound network connection at- 
tempts. A network fingerprint will then consist of 
a set of flow records of the form <protocol, IP 
address, DNS address, port>. Note that the 
DNS address field might be blank if a bot communicates 
with an IP directly, instead of doing a DNS lookup. 

Once network activity is logged, we extract the flow 
records. We execute each binary two times and take the 
network fingerprint to be the set of flow records which 
are common across both executions. This eliminates 
any random connections which do not constitute stable 
behavioral attributes. For example, some binaries har- 
vest e-mail addresses and spam subjects by searching 
google.com for random search terms, and following 
links to the highest-ranked search results; repeated ex- 
ecution identifies and discards these essentially random 
connection attempts. 

Given the network fingerprints NV; and No, of two bi- 
naries 6, and Bz respectively, we define the similarity 
coefficient of the binaries, S'(B,, Bz), to be: 


8(By, By) = NiO Na 
Ny U No] 
If the similarity coefficient of two binaries is sufficiently 
high (we use 0.5 as the threshold), we consider the bina- 
ries to be behavioral duplicates. As well, binaries which 
attempt to send e-mail are classified as spamming bots. 
We took a step to validate our duplicate elimination 
procedure. Unfortunately, given a pair of arbitrary bi- 
naries, determining that they are behavioral duplicates is 
undecidable, so we must rely on an approximation. For 
this, we used five commercial anti-virus tools and a set 
of 500 malicious binaries which made network connec- 
tions. All five anti-virus tools had signatures for only 192 
of the 500 binaries, and we used only these 192 binaries 
in our validation. We considered a pair of binaries to be 
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duplicates if their anti-virus tags matched in the majority 
of five tools. Note that we do not expect the tags to be 
identical across different anti-virus tools. Network fin- 
gerprinting matched this tag-based classification 98% of 
the time, giving us reasonable confidence in its ability 
to detect duplicates. Also, we observed a false-positive 
rate of 0.62%, where the anti-virus tags did not match, 
but network fingerprinting labeled the files as duplicates. 
Note again that anti-virus tools lack signatures for many 
new binaries our crawler analyzes, making them unfit to 
use as our main duplicate suppression method. 


3.3.1 Safely generating fingerprints 


The tension between safety and effectiveness is particu- 
larly evident when constructing signatures of newly gath- 
ered binaries. A safe approach would log emitted net- 
work packets, but drop them instead of transmitting them 
externally; unfortunately, this approach is ineffective, 
since many binaries must first communicate with a C&C 
server or successfully transmit probe email messages be- 
fore fully activating. An effective approach would al- 
low a binary unfettered access to the Internet; unfortu- 
nately, this would be unsafe, as malicious binaries may 
perform DoS attacks, probe or exploit remote vulnerabil- 
ities, transmit spam, or relay botnet control traffic. 

Botlab attempts to walk the tightrope between safety 
and effectiveness. We provide a human operator with 
tools that act as a safety net: traffic destined to privileged 
ports, or ports associated with known vulnerabilities, is 
automatically dropped, and limits are enforced on con- 
nections rates, data transmission, and the total window of 
time in which we allow a binary to execute. As well, Bot- 
lab provides operators with the ability to redirect outgo- 
ing SMTP traffic to spamhole, an emulated SMTP server 
that traps messages while fooling the sender into believ- 
ing the message was sent successfully. 

We are confident that our research to date has been 
safe. However, the transmission of any network traffic 
poses some degree of risk of causing harm to the receiver, 
particularly when the traffic originates from an untrusted 
binary downloaded from the Internet. In Section 6, we 
present our thoughts on the long-term viability of safely 
conducting this research. 


3.3.2 Experience classifying bots 


We have found that certain bots detect when they are be- 
ing run in a virtual machine and disable themselves. To 
identify VMM detection, Botlab generates two network 
fingerprints for each binary: we execute the binary in a 
VMware virtual machine and also on a bare-metal ma- 
chine containing a fresh Windows installation. By com- 
paring the resulting two network fingerprints, we can in- 
fer whether the binary is performing any VM detection. 
Some of the spamming binaries we analyzed made ini- 
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tial SMTP connections, but subsequently refused to send 
spam. For example, one spam bot connected to spam- 
hole, but never sent any spam messages after receiving 
the initial greeting string from the SMTP server. We de- 
duced that this bot was checking that the greeting string 
included the domain name to which the bot was connect- 
ing, and we modified spamhole to return appropriate do- 
main names in the string. 

We also observed that some spam bots perform more 
sophisticated SMTP verification before they send spam. 
For example, when the MegaD bot begins executing, it 
transmits a test e-mail to a special MegaD mail server, 
verifying each header it receives during the SMTP hand- 
shake. MegaD’s mail server returns a message ID string 
after sending the message, which the bot then sends to 
its C&C server. The C&C server verifies that the mes- 
sage with this ID was actually delivered to the MegaD 
mail server before giving any further instructions to the 
bot. Accordingly, to generate a signature for MegaD, and 
later, to continuously execute a captured MegaD node, 
the human operator had to indicate to Botlab to deflect 
SMTP messages destined for MegaD’s mail server from 
the spamhole to the live Internet. 

Some bots do not send spam through SMTP, but in- 
stead use HT'TP-based web services. For example, a Ru- 
stock variant rotates through valid hotmail.com ac- 
counts to transmit spam. To safely intercept this spam, 
we had to construct infrastructure that spoofs Hotmail’s 
login and mail transmission process, including using fake 
SSL certificates during login. Fortunately, this variant 
does not check the SSL certificates for validity. How- 
ever, if the variant evolves and validates the certificate, 
we would not be able to safely analyze it. 


3.4 Execution Engine 


Botlab executes spamming bot binaries within its execu- 
tion engine. The engine runs each bot within a VM or on 
a dedicated bare-metal box, depending on whether the 
bot binary performs VMM detection. In either case, Bot- 
lab sandboxes network traffic to prevent harm to external 
hosts. We re-use the network safeguards described in the 
previous section in the execution engine sandbox: our 
sandbox redirects outgoing e-mail to spamhole, permits 
only traffic patterns previously identified as safe by a hu- 
man operator to be transmitted to the Internet, and drops 
all other packets. Traffic permitted on the Internet is also 
subject to the same rate limiting policies we previously 
described. 

Though we have analyzed thousands of malware bi- 
naries to date, only a surprisingly small fraction corre- 
spond to unique spamming botnets. In fact, we have 
so far found just seven spamming bots: Grum, Kraken, 
MegaD, Pushdo, Rustock, Srizbi, and Storm. (Botnet 
names are derived according to tags with which anti- 


virus tools classify the corresponding binaries.) We be- 
lieve these are the most prominent spam botnets existing 
today, and our results suggest that they are responsible 
for sending most of the world’s spam. Thus, it appears 
the spam botnet landscape consists of just a handful of 
key players. 


3.4.1 Avoiding blacklisting 


If the botnet owners learn about Botlab’s existence, they 
might attempt to blacklist IP addresses belonging to the 
University of Washington. The C&C servers would then 
refuse connections to Botlab’s captive bots, rendering 
Botlab ineffective. To prevent this, Botlab routes any bot 
traffic permitted onto the Internet, including C&C traffic, 
through the anonymizing Tor network [6]. Our malware 
crawler is also routed through Tor. While Tor provides 
a certain degree of anonymity, a long-term solution to 
avoid blacklisting would be to install monitoring agents 
at geographically diverse and secret locations, with the 
hosting provided by organizations that desire to combat 
the botnet threat. 


Some bots track and report the percentage of e-mail 
messages successfully sent and e-mail addresses for 
which sending failed. These lists can be used by bot- 
net owners to filter out invalid or outdated addresses. To 
avoid detection, we had to ensure that our bots did not re- 
port 100% delivery rates, as these are unlikely to happen 
in the real world. Doing so was easy; our bots experience 
many send errors because of failed DNS lookups for mail 
servers. Thus, we simply rely on DNS to provide us with 
a source of randomness in bot-reported statistics. Should 
bot masters begin to perform a more complicated statis- 
tics analysis, more controlled techniques for introducing 
random failures in spamhole might become necessary. 


3.4.2. Multiple C&C servers 


Some botnets partition their bots across several C&C 
servers. For example, in Srizbi, different C&C servers 
are responsible for sending different classes of spam. 
These spam classes differ in subject line, content, em- 
bedded URLs, and even languages. If we were to run 
only a single Srizbi bot binary, it would connect to one 
C&C server, and therefore we would only have a partial 
view of the overall botnet activity. 


To rectify this, we take advantage of a C&C redun- 
dancy mechanism built into many bots, including Srizbi: 
if the primary C&C server goes down, an alternate C&C 
server is selected either via hardcoded IP addresses or 
programmatic DNS lookups. Botlab can thus block the 
primary C&C server(s) and learn additional C&C ad- 
dresses. Botlab can then run multiple instances of the 
same bot, each routed to a different C&C server. 
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3.5 Correlating mcoming and outgoing 
spam 


Botlab’s correlation analyzer combines our different 
sources of botnet information to provide a more complete 
view into overall botnet activity. For example, armed 
with a real-time outgoing spam feed, we can classify 
spam received by our incoming spam feed according to 
the botnet that is responsible for sending it. We will de- 
scribe how we derived our classification algorithm and 
evaluate its accuracy in Section 4.3.1. 

For spam that cannot be attributed to a particular bot- 
net using our correlation analysis, we use clustering anal- 
ysis to identify sets of relays used in the same spam cam- 
paign. In Section 4.2, we evaluate various ways in which 
this clustering can be performed. If there is a significant 
overlap between a campaign’s relay cluster and known 
members of a particular botnet (where botnet member- 
ship information is derived from the earlier correlation 
analysis), then we can merge the two sets of relays to 
derive a more complete view of botnet membership. 


3.6 Summary 


We have outlined an architecture for Botlab, a real-time 
spam botnet monitoring system. Some elements of Bot- 
lab have been proposed elsewhere; our principal contri- 
bution is to assemble these ideas into an end-to-end sys- 
tem that can safely identify malicious binaries, remove 
duplicates, and execute them without being blacklisted. 
By correlating the activity of captured bots with the ag- 
gregate incoming spam feed, the system has the potential 
to provide more comprehensive information on spam- 
ming botnets and also enable effective defenses against 
them. We discuss these issues in the remainder of the 


paper. 
4 Analysis 


We now present an analysis of botnets that is enabled 
by our monitoring infrastructure. First, we examine the 
actions of the bots being run in Botlab, characterize 
their behavior, and analyze the properties of the outgo- 
ing spam feed they produce. Second, we analyze our 
incoming spam feed to extract coarse-grained, aggregate 
information regarding the perpetrators of malicious ac- 
tivity. Finally, we present analysis that is made possi- 
ble by studying both the outgoing and incoming spam 
feeds. Our study reveals several interesting aspects of 
spamming botnets. 


4.1 The Spam Botnets 


In our analysis, we focus on seven spam botnets: Grum, 
Kraken, MegaD, Pushdo, Rustock, Srizbi, and Storm. 
Although our malware crawler analyzed thousands of 
potential executables, after network fingerprinting and 
pruning described earlier, we found that only variants of 
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these seven bots actively send spam. Next, we summa- 
rize various characteristics of these botnets and our ex- 
perience running them. 


4.1.1 Behavioral Characteristics 


Table 1 summarizes various characteristics of our bot- 
nets, which we have monitored during the past six 
months. The second column depicts the number of days 
on which we have observed a botnet actively sending 
spam. We found that keeping all botnets active simul- 
taneously is difficult. First, locating a working binary for 
each botnet required vastly different amounts of time, de- 
pending on the timings of botnet propagation campaigns. 
For example, we have only recently discovered Grum, a 
new spamming botnet which has only been active for 8 
days, whereas Rustock has been running for more than 
5 months. Second, many bots frequently go offline for 
several days, as C&C servers are taken down by law en- 
forcement, forcing the bot herders to re-establish new 
C&C servers. Sometimes this breaks the bot binary, 
causing a period of inactivity until a newer, working ver- 
sion is found. 

The amount of outgoing spam an individual bot can 
generate is vastly different across botnets. MegaD and 
Srizbi bots are the most egregious: they can send out 
more than 1,500 messages per minute, using as many as 
80 parallel connections at a time, and appear to be lim- 
ited only by the client’s bandwidth. On the other hand, 
Rustock and Storm are “polite” to the victim — they send 
messages at a slow and constant rate and are unlikely to 
saturate the victim’s network connection. Big variabil- 
ity in send rates suggests these rates might be useful in 
fingerprinting and distinguishing various botnets. 

Bots use various methods to locate and communicate 
with their C&C servers. We found that many botnets use 
very simple schemes. Rustock, Srizbi, and Pushdo sim- 
ply hardcode the C&C’s IP address in the bot binary, and 
MegaD hardcodes a DNS name. Kraken uses a propri- 
etary algorithm to generate a sequence of dynamic DNS 
names, which it then attempts to resolve until it finds a 
working name. An attacker registers the C&C server at 
one of these names and can freely move the C&C to an- 
other name in the event of a compromise. In all of these 
cases, Botlab can efficiently pinpoint the IP addresses of 
the active botnet C&C servers; if these servers could be 
efficiently located and shut down, the amount of world- 
wide spam generated would be substantially reduced. 

Although recent analysis suggests that botnet control 
is shifting to complicated decentralized protocols as ex- 
emplified by Storm [20, 33], we found the majority of 
our other spam bots use HTTP to communicate with their 
C&C server. Using HTTP is simple but effective, since 
bot traffic is difficult to filter from legitimate web traffic. 
HTTP also yields a simple pull-based model for botnet 
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# days active total spam spam send rate eee ee 
Botnet C&C protocol contacted over C&C discovery 
in trace messages (messages/min) 
lifetime 
static IP 
Grum 864,316 encrypted HTTP, port 80 (206.51.231.192) 
Kraken 5,046,803 encrypted HTTP, port 80 aigontninie ENS 
lookups 
Pushdo 59 days 4,932,340 — encrypted HTTP, port 80 ae set ) set of static IPs static IPs 
Rustock | 164 days 7,174,084 encrypted HTTP, port 80 te 
y ryP P (208.72.169.54) 
encrypted custom static DNS name 
ee ae rege): bisects protocol, ports 80 and 443 (majzufaiuq.info) 
ae unencrypted HTTP, 
Srizbi 86,003,889 port 4099 set of static IPs 
Storm 961,086 compressed TCP p2p (Overnet) 


Table 1: The botnets monitored in Botlab. Table gives characteristics of representative bots participating in the seven botnets. 
Some bots use all available bandwidth to send more than a thousand messages per minute, while others are rate-limited. Most 
botnets use HTTP for C&C communication. Some do not ever change the C&C server address yet stay functional for a long time. 


operators: a new bot makes an HTTP request for work 
and receives an HTTP response that defines the next task. 
Upon completing the task, the bot makes another request 
to relay statistics, such as valid and invalid destination 
addresses, to the bot master. All of our HTTP bots fol- 
low this pattern, which is easier to use and appears just 
as sustainable as a decentralized C&C protocol such as 
Storm’s protocol. 


We checked whether botnets frequently change their 
C&C server to evade detection or reestablish a com- 
promised server. The column “C&C servers contacted” 
of Table 1 shows how many times a C&C server used 
by a bot was changed. Surprisingly, many bots change 
C&C servers very infrequently; for example, the various 
copies of Rustock and Srizbi bots have used the same 
C&C IP address for 164 and 51 days, respectively, and 
experienced no downtime during these periods. Some 
bots are distributed as a set of binaries, each with differ- 
ent hardcoded C&C information. For example, we found 
20 variants of Srizbi, each using one hardcoded C&C IP 
address. The C&C changes are often confined to a par- 
ticular subnet; the 10 most active /16 subnets contributed 
103 (57%) of all C&C botnet servers we’ve seen. As 
well, although none of the botnets shared a C&C server, 
we found multiple overlaps in the corresponding subnets; 
one subnet (208.72.*.*) provided C&C servers for Srizbi, 
Rustock, Pushdo, and MegaD, suggesting infrastructural 
ties across different botnets. 


As it turns out, these botnets had many of their C&C 
servers hosted by McColo, a US based hosting provider. 
On November 11, McColo was taken offline by its ISPs, 
and as aresult, the amount of spam reaching the Univer- 
sity of Washington dropped by almost 60%. As of Febru- 
ary 2009, the amount of spam reaching us has steadily 
increased to around 80% of the pre-shutdown levels as 


some of the botnet operators have been able to redirect 
their bots to new C&C servers, and in addition, new bot- 
nets have sprung up to replace the old ones. 


4.1.2 Outgoing Spam Feeds 


The spam generated by our botnets is a rich source of in- 
formation regarding their malicious activities. The con- 
tent of the spam emails can be used to identify the scams 
perpetrated by the botnets (as discussed in Section 4.3) 
and help develop application-level defenses for end-hosts 
(see Section 5). In this section, we analyze the character- 
istics of the spam mailing lists, discuss the reach of var- 
ious botnets, and examine whether spam subjects could 
be used as fingerprints for the botnets. 


Size of mailing lists: We first use the outgoing spam 
feeds to estimate the size of the botnets’ recipient lists. 
We assume the following model of botnet behavior: 


e A bot periodically obtains a new chunk of recipients 
from the master and sends spam to this recipient list. 
Let c be the chunk size. 


e On each such request, the chunk of recipients is se- 
lected uniformly at random from the spam list. 


e The chunk of recipients received by a bot is much 
smaller than the spam list size NV. 


Assuming these are true, the probability of a particular 
email address from the spamlist appearing in & chunks of 
recipients obtained by a bot is 1 — (1 — c/N)*. As the 
second term decays with k, the spam feed will expose the 
entire recipient list in an asymptotic manner, and eventu- 
ally most newly-picked addresses will be duplicates of 
previous picks. Further, if we recorded the first m recipi- 
ent addresses from a spam trace, the expected number of 
repetitions of these addresses within the next & chunks is 
m{l — (1—c/N)*]. 
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We have observed that MegaD, Rustock, Kraken, and 
Storm follow this model. We fit the rates at which they 
see duplicates in their recipient lists into the model above 
to obtain their approximate spam list sizes. We present 
the size estimates at a confidence level of 95%. We esti- 
mate Megab’s spam list size to be 850 million addresses 
(40.2%), Rustock’s to be /.2 billion (43%), Kraken’s 
to be 350 million (40.3%), and Storm’s 110 million 
(46%). 

Srizbi and Pushdo partition their spam lists in a way 
that precludes the above analysis. We have not yet col- 
lected enough data for Grum to reliably estimate its spam 
list size — our bot has not sent enough emails to see du- 
plicate recipient email addresses. 

















MegaD Kraken Rustock 
Kraken 7% 
MegaD 9% 
Pushdo 0% 
Rustock N/A 
Srizbi 21% 10% 8% 
Storm 24% 11% 7% 














Table 2: Overlap between recipient spam lists. The table 
shows the fraction of each botnet’s recipient list that is shared 
with MegaD, Kraken, and Rustock’s recipient lists. For exam- 
ple, Kraken shares 28% of its recipient list with MegaD. 


Overlap in mailing lists: We also examined whether 
botnets systematically share parts of their spam lists. To 
do this, we have measured address overlap in outgoing 
spam feeds collected thus far and combined it with mod- 
eling similar to that in the previous section (more details 
are available in [16]). We found that overlaps are sur- 
prisingly small: the highest overlap is between Kraken 
and MegaD, which share 28% of their mailing lists. It 
appears different botnets cover different partitions of the 
global email list. Thus, spammers can benefit from using 
multiple botnets to get wider reach, a behavior that we in 
fact observe and discuss in Section 4.3. 


Spam subjects: Botnets carefully design and hand-tune 
custom spam subjects to defeat spam filters and attract at- 
tention. We have found that between any two spam bot- 
nets, there is no overlap in subjects sent within a given 
day, and an average overlap of 0.3% during the length 
of our study. This suggests that subjects are useful for 
classifying spam messages as being sent by a particular 
botnet. To apply subject-based classification, we remove 
any overlapping subjects, leaving, on average, 489 sub- 
jects per botnet on a given day. As well, a small number 
of subjects include usernames or random message IDs. 
We remove these elements and replace them with reg- 
exps using an algorithm similar to AutoRE [39]. We will 
evaluate and validate this classification scheme using our 
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Figure 3: Fraction of spam that is captured by using IP- 
based blacklists. We find that using relays seen locally so far 
works as well as a commercial blacklist, and can block almost 
60% of the spam. 


incoming spam in Section 4.3.1. 


4.2 Analysis of Incoming Spam 


We analyze 46 million spam messages obtained from a 
50-day trace of spam from University of Washington and 
use it to characterize the hosts sending the spam, the 
scam campaigns propagated using spam, and the web 
hosting infrastructure for the scams. To do this, each 
spam message is analyzed to extract the set of relays 
through which the purported sender forwarded the email, 
the subject, the recipient address, other SMTP headers 
present in the email, and the various URLs embedded 
inside the spam body. 

We found that on average, 89.2% of the incoming mail 
at UW is classified as spam by UW’s filtering systems. 
Around 0.5% of spam contain viruses as attachments. 
Around 95% of the spam messages contain HTTP links, 
and 1% contain links to executables. 


4.2.1 Spam sources 


Figure 2 plots the total number of distinct last-hop relays 
seen in spam messages over time. We consider only the 
IP of the last relay used before a message reaches UW’s 
mail servers, as senders can spoof other relays. The num- 
ber of distinct relay IPs increases steadily over time and 
reaches 9.5 million after 7 weeks worth of spam mes- 
sages. Two factors could be responsible for keeping this 
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Figure 4: Number of messages sourced by distinct relay IPs, 
over a single day and the entire trace. 


growth linear. One is a constant balance between the in- 
flux of newly-infected bots and the disappearance of dis- 
infected hosts. Another is the use of dynamic IP (DHCP) 
leases for end hosts, which causes the same physical ma- 
chine to manifest itself under different IPs. To place a 
lower bound on the number of distinct spam hosts given 
the DHCP effect, Figure 2 also shows the number of dis- 
tinct /24’s corresponding to spam relays, assuming that 
the IPs assigned by DHCP to a particular host stay in the 
/24 range. 

The constantly changing list of IPs relaying spam 
does suggest that simple IP-based blacklists, such as the 
Spamhaus blacklist [31], will not be very effective at 
identifying spam. ‘To understand the extent to which 
this churn impacts the effectiveness of IP-based black- 
lists, we analyze four strategies for generating blacklists 
and measure their ability to filter spam. First, we con- 
sider a blacklist comprising of the IP addresses of the 
relays which sent us spam a week ago. Next, we have a 
blacklist that is made up of the IP addresses of the relays 
which sent us spam the previous day. Third, we consider 
a blacklist that contains the IP addresses of the relays 
which sent us spam at any point in the past. Finally, we 
look at a commonly used blacklist such as the Composite 
Blocking List (CBL) [4]. Figure 3 shows the compari- 
son. The first line shows how quickly the effectiveness 
of a blacklist drops with time, with a week-old blacklist 
blocking only 20% of the spam. Using the relay IPs from 
the previous day blocks around 40% of the spam, and us- 
ing the entire week’s relay IPs can decrease the volume 
of spam by 50 — 60%. Finally, we see that a commer- 
cial blacklist performs roughly as well as the local black- 
list which uses a weeks’ worth of information. We view 
these as preliminary results since a rigorous evaluation 
of the effectiveness of blacklists is possible only if we 
can also quantify the false positive rates. We defer such 
analysis to future work. 

We next analyze the distribution of the number of mes- 
sages sent by each spam relay. Figure 4 graphs the num- 
ber of messages each distinct relay has sent during our 
trace. We also show the number of messages sent by 
each relay on a particular day, where DHCP effects are 


2e+06 
1.8e+06 
1.6e+06 
1.4e+06 
1.2e+06 
le+06 
800000 
600000 
400000 
200000 
0 





Number of hostnames 











0 200 400 600 £800 = 1000 


Number of hours since start 


1200 1400 


Figure 5: Number of distinct hostnames in URLs conveyed 
by spam. Spammers constantly register new DNS names. 
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Figure 6: Clustering spam messages by the IP of URLs con- 
tained within them. Links in 80% of spam point to only 15 
distinct IP clusters. 


less likely to be manifested. On any given day, only a 
few tens of relays send more than 1,000 spam messages, 
with the bulk of the spam conveyed by the long tail. In 
fact, the relays that sent over 100 messages account for 
only 10% of the spam, and the median number of spam 
messages per relay is 6. One could classify the heavy 
hitters as either well-known open mail relays or heav- 
ily provisioned spam pumps operated by miscreants. We 
conjecture that most of the long tail corresponds to com- 
promised machines running various kinds of bots. 


4.2.2 Spam campaigns and Web hosting 


We next examine whether we can identify and charac- 
terize individual spam campaigns based on our incoming 
spam. Ideally, we would cluster messages based on sim- 
ilar content; however, this is difficult as spammers use 
sophisticated content obfuscation to evade spam detec- 
tion. Fortunately, more than 95% of spam in our feed 
contains links. We thus cluster spam based on the fol- 
lowing attributes: 1) the domain names appearing in the 
URLs found in spam, 2) the content of Web pages linked 
to by the URLs, and 3) the resolved IP addresses of the 
machines hosting this content. We find that the second 
attribute is the most useful for characterizing campaigns. 

Clustering with URL domain names revealed that for 
any particular day, 10% of the observed domain names 
account for 90% of the spam. By plotting the num- 
ber of distinct domain names observed in our spam feed 
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over time (shown in Figure 5), we found that the number 
of distinct hostnames is large and increases steadily, as 
spammers typically use newly-registered domains. (In 
fact, on average, domain names appearing in our spam 
are only two weeks old based on whois data.) Conse- 
quently, domain-based clustering is too fine-grained to 
reveal the true extent of botnet infections. 

Our content clustering is performed by fetching the 
Web page content of all links seen in our incoming spam. 
We found that nearly 80% of spam pointed to just 11 dis- 
tinct Web pages, and the content of these pages did not 
change during our study. We conclude that while spam- 
mers try to obfuscate the content of messages they send 
out, the Web pages being advertised are static. Although 
this clustering can identify distinct campaigns, it cannot 
accurately attribute them to specific botnets. We revisit 
this clustering method in Section 4.3.2, where we add 
information about our botnets’ outgoing spam. 

For IP-based clustering, we analyzed spam messages 
collected during the last week of our trace. We ex- 
tracted hostnames from all spam URLs and performed 
DNS lookups on them. We then collected sets of re- 
solved IPs from each lookup, merging any sets sharing 
a common IP. Finally, we grouped spam messages based 
on these IP clusters; Figure 6 shows the result. We found 
that 80% of the spam corresponds to the top 15 IP clus- 
ters (containing a total of 57 IPs). In some cases, the 
same Web server varied content based on the domain 
name that was used to access it. For example, a single 
server in Korea hosted 20 different portals, with demul- 
tiplexing performed using the domain name. We conjec- 
ture that such Web hosting services are simultaneously 
supporting a number of different spam campaigns. As a 
consequence, web-server-based clustering is too coarse- 
grained to disambiguate individual botnets. 


4.3 Correlation Analyses 


We now bring together two of our data sources, our out- 
going and incoming spam feeds, and perform various 
kinds of correlation analyses, including: 1) classifying 
spam according to which botnet sourced it, 2) identify- 
ing spam campaigns and analyzing botnet partitioning, 
3) classifying and analyzing spam used for recruiting 
new victims, and 4) estimating botnet sizes and produc- 
ing botnet membership lists. Note that we exclude Grum 
from these analyses, because we have not yet monitored 
this recently discovered bot for a sufficiently long time. 


4.3.1 Spam classification 


To classify each spam message received by University 
of Washington as coming from a particular botnet, we 
use subject-based signatures we derived in Section 4.1.2. 
Each signature is dynamic — it changes whenever bot- 
nets change their outgoing spam. We have applied these 
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Figure 7: Average contributions of each botnet to incoming 
spam received at University of Washington. 79% of spam 
comes from six spam botnets, and 35% comes from just one 
botnet, Srizb1. 
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Figure 8: Breakdown of spam e-mails by botnet over time. 
Most botnets contribute approximately the same fraction of 
spam to our feed over our study period, with Srizbi, Rustock, 
and MegaD being the top contributors. Kraken shows gaps in 
activity on days 28-32 and 52. Day | corresponds to March 13, 
2008. 


signatures to a 50-day trace of incoming spam messages 
received at University of Washington in March and April 
2008. Figure 7 shows how much each botnet contributed 
to UW spam on average, and Figure 8 shows how the 
breakdown behaves over time. We find that on average, 
our six botnets were responsible for 79% of UW’s in- 
coming spam. This is a key observation: it appears that 
for spam botnets, only a handful of major botnets pro- 
duce most of today’s spam. In fact, 35% of all spam is 
produced by just one botnet, Srizbi. This result might 
seem to contradict the lower bound provided by Xie et 
al. [39], who estimated that 16 — 18% of the spam in 
their dataset came from botnets. However, their dataset 
excludes spam sent from blacklisted IPs, and a large frac- 
tion of botnet IPs are present in various blacklists (as 
shown in Sections 4.2 and [28]). 


We took a few steps to verify our correlation. First, 
we devised an alternate classification based on certain 
unique characteristics of the “Message ID” SMTP header 
for Srizbi and MegaD bots, and verified that the classifi- 
cation does not change using that scheme. Second, we 
extracted last-hop relays from each classified message 
and checked overlaps between sets of botnet relays. The 
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Table 3: Clustering incoming spam by the title of the web 
page pointed to by spam URLs. The columns show how fre- 
quently each botnet was sending each campaign on April 30, 
2008. Many botnets carry out multiple campaigns simultane- 
ously. 


overlaps are small; it is never the case that many of the re- 
lays belonging to botnet X are also in the set of relays for 
botnet Y. The biggest overlap was 3.3% between Kraken 
and MegaD, which we interpret as 3.3% of Kraken’s re- 
lays also being infected with the MegaD bot. 


4.3.2 Spam campaigns 


To gain insight into kinds of information spammers dis- 
seminate, we classified our incoming spam according to 
spam campaigns. We differentiate each campaign by the 
contents of the web pages pointed to by links in spam 
messages. Using data from Section 4.3.1, we classify 
our incoming spam according to botnets, and then break 
down each botnet’s messages into campaign topics, de- 
fined by titles of campaign web pages. Table 3 shows 
these results for a single day of our trace. For exam- 
ple, Rustock participated in two campaigns — 22% of its 
messages advertised “Canadian Healthcare”, while 78% 
advertised “MaxGain+’. We could not fetch some links 
because of failed DNS resolution or inaccessible web 
servers; we marked these as “Unavailable”. The table 
only shows the most prevalent campaigns; a group of less 
common campaigns is shown in row marked “Other”. 
All of our botnets simultaneously participate in mul- 
tiple campaigns. For example, Kraken and Pushdo par- 
ticipate in at least 5 and 7, respectively. The percent- 
ages give insight into how the botnet divides its bots 
across various campaigns. For example, Kraken might 
have four customers who each pay to use approximately 
20% of the botnet to send spam for “Canadian Phar- 
macy’, “Diamond Watches’, “Freedom from Debt’, and 
“VPXL”. Multiple botnets often simultaneously partic- 
ipate in a single campaign, contrary to an assumption 
made by prior research [40]. For example, “Canadian 
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Table 4: Overlap in hosting infrastructure of the web pages 
pointed to by spam URLs. The table shows what fraction of 
spam sent by different botnets on April 30, 2008 contain URLs 
pointing to the same webservers. 
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Figure 9: Propagation campaigns. The graph shows the num- 
ber of e-mails with links that infected victims with either Srizbi, 
Storm, or Pushdo. 


Pharmacy” is distributed by Kraken, MegaD, Pushdo, 
Srizbi, and Storm. This suggests the most prominent 
spammers utilize the services of multiple botnets. 


Botnets use different methods to assign their bots to 
campaigns. For example, Botlab monitors 20 variants 
of Srizbi, each using a distinct C&C server. Each C&C 
server manages a set of campaigns, but these sets often 
differ across C&C servers. For example, bots using C&C 
server X and Y might send out “Canadian Pharmacy” 
(with messages in different languages), whereas server Z 
divides bots across “Prestige Replicas” and “Diamond 
Watches”. Thus, Srizbi bots are partitioned statically 
across 20 C&C servers, and then dynamically within 
each server. In contrast, all of our variants of Rustock 
contact the same C&C server, which dynamically sched- 
ules bots to cover each Rustock campaign with a certain 
fraction of the botnet’s overall processing power, behav- 
ing much like a lottery scheduler [35]. 


In Section 4.2.2, we saw that most of the Web servers 
support multiple spam campaigns. Now, we examine 
whether the Web hosting is tied to particular botnets, 1.e., 
whether all the spam campaigns hosted on a server are 
sent by the same botnet. In Table 4, we see that this is 
not the case — every pair of botnets shares some hosting 
infrastructure. This suggests that scam hosting is more 
of a3”? party service that is used by multiple (potentially 
competing) botnets. 
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Botnet Kraken MegaD Pushdo | Rustock Srizbi Storm 


# unique 


20,275 
relays seen 


57,402 27,266 83,836 119,604 7,814 


























Table 5: The number of unique relays belonging to each bot- 
net. These numbers provide a lower bound on the size of each 
botnet, as seen on September 3, 2008. More accurate estimates 
are possible by accounting for relays not seen in our spam feed. 


4.3.3 Recruiting campaigns 


Using our correlation tools, we were able to identify in- 
coming spam messages containing links to executables 
infecting victims with the Storm, Pushdo, or Srizbi bot. 
Figure 9 shows this activity over our incoming spam 
trace. The peaks represent campaigns launched by bot- 
nets to recruit new victims. We have observed two such 
campaigns for Storm — one for March 13-16 and another 
centered on April 1, corresponding to Storm launching 
an April Fool’s day campaign, which received wide cov- 
erage in the news [23]. Srizbi appears to have a steady 
ongoing recruiting campaign, with peaks around April 
15-20, 2008. Pushdo infects its victims in bursts, with 
a new set of recruiting messages being sent out once a 
week. 

We expected the spikes to translate to an increase in 
number of messages sent by either Srizbi, Storm, or 
Pushdo, but surprisingly this was not the case, as seen 
by matching Figures 9 and 8. This suggests that bot 
operators do not assign all available bots to send spam 
at maximum possible rates, but rather limit the overall 
spam volume sent out by the whole botnet. 


4.3.4 Botnet membership lists and sizes 


A botnet’s power and prowess is frequently measured by 
the botnet’s size, which we define as the number of ac- 
tive hosts under the botnet’s control. A list of individual 
nodes comprising a botnet is also useful for notifying and 
disinfecting the victims. We next show how Botlab can 
be used to obtain information on both botnet size and 
membership. 

As before, we classify our incoming spam into sourc- 
ing botnets and extract the last-hop relays from all suc- 
cessfully classified messages. After removing dupli- 
cates, these relay lists identify hosts belonging to each 
botnet. Table 5 shows the number of unique, classified 
relays for a particular day of our trace. Since botnet 
membership is highly dynamic, we perform our calcu- 
lations for a single day, where churn can be assumed to 
be negligible. As well, we assume DHCP does not affect 
the set of unique relays on a timescale of just a single 
day. These numbers of relays for each botnet are effec- 
tively the lower bound on the botnet sizes. The actual 
botnet sizes are higher, since there are relays that did not 
send spam to University of Washington, and thus were 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 


not seen in our trace. We next estimate the percentage of 
total relays that we do see, and use it to better estimate 
botnet sizes. 


Let us assume again that a bot sends spam to email ad- 
dresses chosen at random. Further, let p be the probabil- 
ity with which a spam message with a randomly chosen 
email address is received by our spam monitoring sys- 
tem at University of Washington. If n is the number of 
messages that a bot sends out per day, then the probabil- 
ity that at least one of the messages generated by the bot 
is received by our spam monitors is [1 — (1 — p)”]. For 
large values of n, such as when n ~ 1/p, the probability 
of seeing one of the bot’s messages can be expressed as 
ll —e-™?]. 


We derive p using the number of spam messages re- 
ceived by our spam monitor and an estimate of the global 
number of spam messages. With our current setup, 
the former is approximately 2.4 million daily messages, 
while various sources estimate the latter at 100-120 bil- 
lion messages (we use 110 billion) [14, 21, 32]. This 
gives p = 2400000/110 billion = 2.2 - 107°. 


For the next step, we will describe the logic using one 
of the botnets, Rustock, and later generalize to other bot- 
nets. From Section 4.1, we know that Rustock sends 
spam messages at a constant rate of 47.5 messages per 
day and that this rate is independent of the access link ca- 
pacity of the host. Note that Rustock’s sending rate trans- 
lates to a modest rate of 1 spam message per two seconds, 
or about 0.35 KB/s given that the average Rustock mes- 
sage size is 700 bytes — a rate that can be supported by 
almost all end-hosts [15]. The probability that we see the 
IP of a Rustock spamming bot node in our spam moni- 
tors on a given day is [1 — e~ 47500-22107") — 0.65. This 
implies that the 83,836 Rustock IPs we saw on Septem- 
ber 3rd represent about 65% of all Rustock’s relays; thus, 
the total number of active Rustock bots on that day was 
about 83, 836/0.65 = 128,978. Similarly, we estimate 
the active botnet size of Storm to be 16,750. We would 
like to point that these estimates conservatively assume 
a bot stays active 24 hours per day. Because some bots 
are powered off during the night, these botnet sizes are 
likely to be higher. 


These estimates rely on the fact that both Rustock and 
Storm send messages at a slow, constant rate that is un- 
likely to saturate most clients’ bandwidth. Our other bots 
send spam at higher rates, with the bot adapting to the 
host’s available bandwidth. Although this makes it more 
likely that a spamming relay is detected in our incoming 
spam, it is also more difficult to estimate the number of 
messages a given bot sends. In future work, we plan to 
study the rate adaptation behavior of these bots and com- 
bine it with known bandwidth profiles [15]. Meanwhile, 
Table 5 gives conservative size estimates. 
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5 Applications enabled by Botlab 


Botlab provides various kinds of real-time botnet infor- 
mation, which can be used by end hosts wishing for pro- 
tection against botnets, or by ISPs and activists for law 
enforcement. Next, we discuss applications enabled by 
our monitoring infrastructure. 


5.1 Safer web browsing 


Spam botnets propagate many harmful links, such as 
links to phishing sites or to web pages installing mal- 
ware. For example, on September 24, 2008, we observed 
the Srizbi botnet distribute 40,270 distinct links to pages 
exploiting Flash to install the Srizbi bot. Although the 
current spam filtering tools are expected to filter out spam 
messages containing these links, we found that this is 
often not the case. For example, we have forwarded a 
representative sample of each of Srizbi’s outgoing spam 
campaigns to a newly-created Gmail account controlled 
by us, where we have used Gmail’s default spam filter- 
ing rules, and found that 79% of spam was not filtered 
out. Worse, Gmail filters are not “improving” quickly 
enough, as forwarding the same e-mails two days later 
resulted in only a 5% improvement in detection. Users 
are thus exposed to many messages containing danger- 
ous links and social engineering traps enticing users to 
click on them. 


Botlab can protect users from such attacks using its 
real-time database of malicious links seen in outgoing, 
botnet-generated spam. For example, we have developed 
a Firefox extension, which checks the links a user visits 
against this database before navigating to them. In this 
way, the extension easily prevented users from browsing 
to any of the malicious links Srizbi sent on September 
24. 


Some existing defenses use blacklisting to prevent 
browsers from following malicious links. We have 
checked two such blacklists, the Google Safe Browser 
API and the Malware Domain List, six days after the 
links were sent out, and found that none of the 40,270 
links appeared on either list. These lists suffer from the 
same problem: they are reactive, as they rely on crawl- 
ing and user reports to find malicious links after they are 
disseminated. These methods fail to quickly and exhaus- 
tively find “zero-day” botnet links, which point to mal- 
ware hosted on recently compromised web servers, as 
well as malware hosted on individual bots via fast-flux 
DNS and a continuous flow of freshly-registered domain 
names. In contrast, Botlab can keep up with spam bot- 
nets because it uses real-time blacklists, which are up- 
dated with links at the instant they are disseminated by 
botnets. 


5.2 Spam Filtering 


Spam continuously pollutes email inboxes of many mil- 
lions of Internet users. Most email users use spam fil- 
tering software such as SpamAssassin [30], which uses 
heuristics-based filters to determine whether a given 
message is spam. The filters usually have a threshold 
that a user varies to catch most spam while minimizing 
the number of false positives — legitimate email mes- 
sages misclassified as spam. Often, this still leaves some 
spam sneaking through. 

Botlab’s real-time information can be used to build 
better spam filters. Specifically, using Botlab, we can 
determine whether a message is spam by checking it 
against the outgoing spam feeds for the botnets we mon- 
itor. This is a powerful mechanism: we simply rely on 
botnets themselves to tell us which messages are spam. 

We implemented this idea in an extension for the 
Thunderbird email client. The extension checks mes- 
sages arriving to the user’s inbox against Botlab’s live 
feeds using a simple, proof-of-concept correlation algo- 
rithm: an incoming message comes from a botnet if 1) 
there is an exact match on the set of URLs contained 
in the message body, or 2) if the message headers are 
in a format specific to that used by a particular botnet. 
For example, all of Srizbi’s messages follow the same 
unique message ID and date format, distinct from all 
other legitimate and spam email. Although the second 
check is prone to future circumvention, this algorithm 
gives us an opportunity to pre-evaluate the potential of 
this technique. Recent work has proposed more robust 
algorithms, such as automatic regular-expression gener- 
ation for spammed URLs in AutoRE [39], and we envi- 
sion adopting these algorithms to use Botlab data to filter 
spam more effectively in real-time settings. 

Although we have not yet thoroughly evaluated our ex- 
tension, we performed a short experiment to estimate its 
effectiveness. One author used the extension for a week, 
and found that it reduced the amount of spam bypassing 
his departmental SpamAssassin filters by 156 messages, 
or 76%, while having a 0% false positive rate. Thus, we 
believe that Botlab can indeed significantly improve to- 
day’s spam filtering tools. 


3.3. Availability of Botlab Data 


To make Botlab’s data publicly available, we have set up 
a web page, http: //botlab.cs.washington.edu/, 
which publishes data and statistics we obtained from 
Botlab. The information we provide currently includes 
activity reports for each spam botnet we monitor, ongo- 
ing scams, and a database of rogue links disseminated in 
spam. We also publish lists of current C&C addresses 
and members of individual botnets. We hope this infor- 
mation will further aid security researchers and activists 
in the continuing fight against the botnet threat. 
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6 Safety 


We have implemented safeguards to ensure that Botlab 
never harms remote hosts, networks, or users. In this 
section, we discuss the impact of these safeguards on the 
effectiveness of Botlab, and our concerns over the long- 
term viability of safely conducting bot research. 


6.1 Impact on effectiveness 


Initially, we hoped to construct a fully automatic plat- 
form that required no manual analysis on the part of an 
operator to find and analyze new botnet binaries. We 
quickly concluded this would be infeasible to do safely, 
as human judgment and analysis is needed to deter- 
mine whether previously uncharacterized traffic is safe 
to transmit. 

Even with a human in the loop, safety concerns caused 
us to make choices that limit the effectiveness of our re- 
search. Our network sandbox mechanisms likely pre- 
vented some binaries from successfully communicating 
with C&C servers and activating, causing us to fail to 
recognize some binaries as spambots, and therefore to 
underestimate the diversity and extent of spamming bot- 
nets. Similarly, it is possible that some of our captive 
bot nodes periodically perform an end-to-end check of e- 
mail reachability, and that our spamhole blocking mech- 
anism causes these nodes to disable themselves or behave 
differently than they would in the wild. 


6.2 The long-term viability of safe botnet 
research 


The only provably safe way for Botlab to execute un- 
trusted code is to block all network traffic, but this would 
render Botlab ineffective. To date, our safeguards have 
let us analyze bot binaries while being confident that 
we have not caused harm. However, botnet trends and 
thought experiments have diminished our confidence that 
we can continue to conduct our research safely. 

Botnets are trending towards the use of proprietary 
encrypted protocols to defeat analysis, polymorphism 
to evade detection, and automatically upgrading to new 
variants to incorporate new mechanisms. It is hard to un- 
derstand the impact of allowing an encrypted packet to be 
transmitted, or to ensure that traffic patterns that were be- 
nign do not become harmful after a binary evolves. Ac- 
cordingly, the risk of letting any network traffic out of a 
captured bot node seems to be growing. 

Simple thought experiments show that it is possible for 
an adversary to construct a bot binary for which there is 
no safe and effective Botlab sandboxing policy. As an 
extreme example, consider a hypothetical botnet whose 
C&C protocol consists of different attack packets. If a 
message is sent to an existing member of the botnet, the 
message will be intercepted and interpreted by the bot. 
However, if a message is sent to a non-botnet host, the 
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message could exploit a vulnerability on that host. If 
such a protocol were adopted, Botlab could not trans- 
mit any messages safely, since Botlab would not know 
whether a destination IP address is an existing bot node. 
Other adversarial strategies are possible, such as embed- 
ding a time bomb within a bot node, or causing a bot 
node to send benign traffic that, when aggregated across 
thousands of nodes, results in a DDoS attack. Moreover, 
even transmitting a “benign” C&C message could cause 
other, non-Botlab bot nodes to transmit harmful traffic. 

Given these concerns, we have disabled the crawling 
and network fingerprinting aspects of Botlab, and there- 
fore are no longer analyzing or incorporating new bina- 
ries. As well, the only network traffic we are letting out 
of our existing botnet binaries are packets destined for 
the current, single C&C server IP address associated with 
each binary. Since Storm communicates with many peers 
over random ports, we have stopped analyzing Storm. 
Furthermore, once the C&C servers for the other botnets 
move, we will no longer allow outgoing network packets 
from their bot binaries. Consequently, the Botlab web 
site will no longer be updated with bots that we have to 
disable. It will, however, still provide access to all the 
data we have collected so far. 

Our future research must therefore focus on deriving 
analysis techniques that do not require bot nodes to inter- 
act with Internet hosts, and determining if it is possible 
to construct additional safeguards that will sufficiently 
increase our confidence in the safety of transmitting spe- 
cific packets. Unfortunately, our instinct is that a moti- 
vated adversary can make it impossible to conduct effec- 
tive botnet research in a safe manner. 


7 Related Work 


Most related work can be classified into four categories: 
malware collection, malware analysis, botnet tracking 
systems, and spam measurement studies. We now 
discuss how our work relates to representative efforts in 
each of these categories. 


Malware collection: Honeypots (such as Honeynet [13] 
and Potemkin [34]) have been a rich source of new 
malware samples. However, we found them less relevant 
for our work, as they failed to find any spam bots. The 
likely cause is that spam botnets have shifted to social- 
engineering-based propagation, relying less on service 
exploits and self-propagating worms. Other projects, 
such as HoneyMonkey [38], have used automated web 
crawling to discover and analyze malware; automated 
web patrol is now part of Google’s infrastructure [24]. 
However, our results show that Google’s database did 
not contain many malicious links seen in our outgoing 
spam feed, indicating that a blind crawl will not find 
malware from spam-distributed links. 
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Malware analysis: Botlab does not perform any 
static analyses of malware binaries. Instead, it generates 
network fingerprints by executing the binaries and 
observing network accesses. [2] and [37] perform 
similar dynamic analysis of binaries by executing them 
in virtual environments and tracking changes in the 
system, such as the creation of files, processes, registry 
entries, etc. [2] uses this information to group malware 
into broad categories, and [37] generates a detailed 
report of the malware’s actions. Since these techniques 
require running the binary in an instrumented setting 
(such as a debugger), they would not be able to analyze 
malware which performs VMM or debugger detection. 
More similar to our approach is [27], which generates 
network fingerprints and uses them to detect IRC bots. 


Botnet tracking: Closely related to our work is 
the use of virtualized execution environments to track 
IRC botnets [27, 41]. By executing a large number 
of IRC bot samples, these efforts first identify the 
IRC servers and then infiltrate the corresponding IRC 
channels to snoop on the botnets. In our experience, 
botnets move away from plaintext IRC protocols to 
encrypted HTTP-based or p2p protocols, requiring more 
elaborate mechanisms as well as human involvement 
for a successful infiltration — a point of view that is 
increasingly voiced in the research community [18].For 
example, Ramachandran et al. [28] infiltrated the 
Bobax botnet by hijacking the authoritative DNS for the 
domain running the C&C server for the botnet. They 
were then able to obtain packet traces from the bots 
which attempted to connect to their C&C server. More 
recently, Kanich et al. [17] infiltrated the command and 
control infrastructure of the Storm botnet, and modified 
the spam being sent in order to measure the conversion 
rates. 


Less related to our work is the research on developing 
generic tools that can be deployed at the network layer to 
automatically detect the presence of bots [19]. Rishi [8] 
is a tool that detects the use of IRC commands and un- 
common server ports in order to identify compromised 
hosts. BotSniffer [11] and BotHunter [10] are other 
network-based anomaly detection tools that work by 
simply sniffing on the network. Our work provides a 
different perspective on bot detection: a single large 
institution, such as University of Washington, can detect 
most of the spamming bots operating at a given point in 
time by simply examining its incoming spam feed and 
correlating it with the outgoing spam of known bots. 


Spam measurement studies: Recently, a number 
of studies have examined incoming spam feeds to 
understand botnet behavior and the scam hosting in- 


frastructure [1, 40, 39]. In [26], the authors use a novel 
approach to collecting spam — by advertising open mail 
forwarding relays, and then collecting the spam that is 
sent through them. Botlab differs from these efforts in 
its use of both incoming and outgoing spam feeds. In 
addition to enabling application-level defenses that are 
proactive as opposed to reactive, our approach yields 
a more comprehensive view of spamming botnets that 
contradicts some assumptions and observations from 
prior work. For instance, a recent study [40] analyzes 
about 5 million messages and proposes novel clustering 
techniques to identify spam messages sent by the same 
botnet, but this is done under the assumption that 
each spam campaign is sourced by a single botnet; we 
observe the contrary to be true. Also, analysis of only 
the incoming spam feed might result in too fine-grained 
a view (at the level of short-term spam campaigns as 
in [39]) and cannot track the longitudinal behavior of 
botnets. Our work enables such analysis due to its 
use of live bots, and in that respect, we share some 
commonality with the recent study of live Storm bots 
and their spamming behavior [20]. 


$ Conclusion 


In this work, we have described Botlab, a real-time bot- 
net monitoring system. Botlab’s key aspect is a multi- 
perspective design that combines a feed of incoming 
spam from the University of Washington with a feed of 
outgoing spam collected by running live bot binaries. By 
correlating these feeds, Botlab can perform a more com- 
prehensive, accurate, and timely analysis of spam bot- 
nets. 

We have used Botlab to discover and analyze today’s 
most prominent spam botnets. We found that just six 
botnets are responsible for 79% of our university’s spam. 
While domain names associated with the scams change 
frequently, the locations of C&C servers, web hosts, and 
even the content of web pages pointed to by scams re- 
main static for long periods of time. A spam botnet typ- 
ically engages in multiple spam campaigns simultane- 
ously, and the same campaign is often purveyed by mul- 
tiple botnets. We have also prototyped tools that use Bot- 
lab’s real-time information to enable safer browsing and 
better spam filtering. Overall, we feel Botlab advances 
our understanding of botnets and enables promising re- 
search in anti-botnet defenses. 
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Abstract 


A large fraction of email spam, distributed denial-of- 
service (DDoS) attacks, and click-fraud on web adver- 
tisements are caused by traffic sent from compromised 
machines that form botnets. This paper posits that by 
identifying human-generated traffic as such, one can ser- 
vice it with improved reliability or higher priority, miti- 
gating the effects of botnet attacks. 

The key challenge is to identify human-generated traf- 
fic in the absence of strong unique identities. We develop 
NAB (“Not-A-Bot’), a system to approximately identify 
and certify human-generated activity. NAB uses a small 
trusted software component called an attester, which runs 
on the client machine with an untrusted OS and applica- 
tions. The attester tags each request with an attestation 
if the request is made within a small amount of time of 
legitimate keyboard or mouse activity. The remote entity 
serving the request sends the request and attestation to a 
verifier, which checks the attestation and implements an 
application-specific policy for attested requests. 

Our implementation of the attester is within the Xen 
hypervisor. By analyzing traces of keyboard and mouse 
activity from 328 users at Intel, together with adversar- 
ial traces of spam, DDoS, and click-fraud activity, we 
estimate that NAB reduces the amount of spam that cur- 
rently passes through a tuned spam filter by more than 
92%, while not flagging any legitimate email as spam. 
NAB delivers similar benefits to legitimate requests un- 
der DDoS and click-fraud attacks. 


1 Introduction 


Botnets comprising compromised machines are 
the major originators of email spam, distributed 
denial-of-service (DDoS) attacks, and click-fraud on 
advertisement-based web sites today. By one measure, 
the current top six botnets alone are responsible for more 
than 85% of all spam mail [23], amounting to more than 
120 billion messages per day that infest more than 95% 
of all inboxes [14, 24]. Botnet-generated DDoS attacks 
account for about five percent of all web traffic [9], 
occurring at a rate of more than 4000 distinct attacks 
per week on average [17]. A problem of a more recent 
vintage, click-fraud, is a growing threat to companies 


that draw revenue from web ad placements [26]; bots are 
said to generate 14-20% of all ad clicks today [8]. 

As a result, if it were possible to tag email or web 
requests as “human-generated,”’ and therefore not “bot- 
generated,” the problems of spam, DDoS, and click-fraud 
could be significantly mitigated. This observation is not 
new, but there is currently no good way to obtain such 
tags automatically without explicit human input. As ex- 
plained in 84, requiring human input (say in the form 
of answering CAPTCHAs [30]) is either untenable (per- 
suading users to answer a CAPTCHA before clicking on 
a web ad or link is unlikely to work well), or ineffective 
(e.g., because today the task of solving CAPTCHAs can 
be delegated to other machines and humans, and not in- 
extricably linked to the request it is intended to validate). 

The problem with obtaining this evidence automati- 
cally is that the client machine may have been compro- 
mised, so one cannot readily trust any information pro- 
vided by software running on the compromised machine. 
To solve this problem, we observe that almost all com- 
modity PCs hitting the market today are equipped with a 
Trusted Platform Module (TPM) [28]. We use this facil- 
ity to build a trusted path between physical input devices 
(the keyboard and mouse, extensible in the future to de- 
vices like the microphone) and a human activity attester, 
which is a small piece of trusted software that runs iso- 
lated from the (untrusted) operating system. 

The key challenge for the attester is to certify human- 
generated traffic without relying on strong unique iden- 
tities. This paper describes NAB, a system that imple- 
ments a general-purpose human activity attester (84), and 
then shows how to use this attester for email to control 
spam, and for web requests to mitigate DDoS attacks 
and click fraud. Attestations are signed statements by 
the trusted attester, and are attached to application re- 
quests such as emails. Attestations are verified by a veri- 
fier module running at the server of an entity interested in 
knowing whether the incoming request traffic was sent as 
a result of human activity. If the attestation is valid (.e., 
it is not forged or used before), that server can take suit- 
able application-specific action—improving the “spam 
score” for an attested email message, increasing the pri- 
ority of an attested web request, etc. NAB requires minor 
modifications to client and server applications to use at- 
testations, and no application protocols such as SMTP or 
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HTTP need be modified. 

NAB’s philosophy is to do no harm to users who do 
not deploy NAB, while benefiting users who do. For ex- 
ample, email senders who use the attester decrease the 
likelihood that their emails are flagged as spam (that is, 
decrease the false positives of spam detectors), and email 
receivers that use the verifier see reduced spam in their 
inboxes. These improvements are preserved even under 
adversarial workloads. Further, since NAB does not use 
identity-based blacklisting or filtering, legitimate email 
from an infected machine can still be delivered with valid 
attestations. 

The NAB approach can run on any platform that pro- 
vides for the attested execution of trusted code, either 
directly or via a secure booting mechanism such as those 
supported by Intel’s TXT and AMD’s Pacifica architec- 
tures. We have constructed our prototype attester as host 
kernel model running under a trusted hypervisor. Other 
implementations, such as building the attester within the 
trusted hardware, or running it in software without virtu- 
alization (e.g., via Flicker [16]) are also possible. 

Our prototype extends the Xen hypervisor [3], thus 
isolating itself from malicious code running within un- 
trusted guest operating systems in a virtual machine. We 
stripped the host kernel and Xen Virtual Machine Mon- 
itor (VMM) down to fewer than 30,000 source lines, in- 
cluding the necessary device drivers, and built the attester 
as a 500-line kernel module. This code, together with 
the TPM and input devices forms the trusted computing 
base (TCB). Generating an attestation on a standard PC 
takes fewer than 10’ CPU cycles, or less than 10 ms on 
a 2 GHz processor, making NAB practical for handling 
fine-grained attestation requests, such as individual web 
clicks or email messages. 

We evaluate whether NAB can be applied to spam con- 
trol, DDoS defense, and click-fraud detection, using a 
combination of datasets containing normal user activity 
and malicious bot activity. We used traces of keyboard 
and mouse activity from 328 PCs of volunteering users 
at Intel gathered over a one-month period in 2007 [11], 
packet-level traces of bot activity that we gathered from a 
small number of “honeypot” computers infected by mal- 
ware at the same site, as well as publicly available traces 
of email spam and DDoS activity. On top of those traces, 
we constructed an adversarial workload that maximizes 
the attacker’s benefit obtained under the constraints im- 
posed by NAB. Our experimental study shows that: 


1. With regards to spam mitigation, we reduced the vol- 
ume of spam messages that evaded a traditional spam 
filter (what are called false negatives for the spam 
filter) by 92%. We reduced the volume of legiti- 
mate, non-spam messages that were misclassified by 
the spam filter (false positives) to 0. 

2. With regards to web DDoS mitigation, we depriori- 
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tized 89% of bot-originated web activity without 1m- 
pacting human-generated web requests. 


3. With regards to click-fraud mitigation, we detected 


bot-originating click-fraud activity with higher than 
87% accuracy, without losing any human-generated 
web clicks. 


Although our specific results correspond only to our par- 
ticular traces, choice of applications, and threat model 
(e.g., NAB does nothing to mitigate the volume of evil 
traffic created manually by an evil human), we argue that 
they apply to a large class of on-line applications affected 
by bot traffic today. Those include games, brokerage, and 
single sign-on services. This suggests that a human ac- 
tivity attestation module might be a worthwhile addition 
to the TCB of commodity systems for the long term. 


2 Threat Model and Goal 


Threat model and assumptions. We assume that the OS 
and applications of a host cannot be trusted, and are sus- 
ceptible to compromise. A host is equipped with a TPM, 
which boots the attester stack—this includes all software 
on which the attester implementation depends, such as 
the host kernel and VMM in our implementation (84.4). 
This trust in the correct boot-up of the attester can be re- 
motely verified, which is the standard practice for TPM- 
assisted secure booting today. We assume that the users 
of subverted hosts may be lax, but not malicious enough 
to mount hardware attacks against their own machine’s 
hardware (such as shaving the protective coating off their 
TPM chip or building custom input hardware). We as- 
sume correct hardware, including the correct operation 
and protection of the TPM chip from software attacks, as 
per its specification [28]. We make no assumptions about 
what spammers do with their own hardware. Finally, we 
assume that the cryptographic primitives we use are se- 
cure, and that their implementations are correct. 

Goal. NAB consists of an attester and a verifier. Our 
primary goal is to distinguish between bot and human- 
generated traffic at the verifier, so that the verifier can 
implement application-specific remedies, such as prior- 
itizing or improving the delivery of human traffic over 
botnet traffic. We would like to do so without requiring 
any user input or imposing any cognitive burden on the 
user. 

We aim to bound the final botnet traffic that man- 
ages to bypass any measures put up against it (spam 
and DDoS filters, click fraud detectors, etc.). We will 
consider our approach successful if we can reduce this 
botnet traffic that evades our best approaches today to a 
small fraction of its current levels (~ 10%), even in the 
worst case for NAB (i.e., with adaptive bots that modu- 
late their behavior to gain the maximum benefit allow- 
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able by our mechanism), while still identifying all valid 
human-generated traffic correctly. We set this goal be- 
cause we do not believe that purely technical approaches 
such as NAB will completely suppress attack traffic such 
as spam, since spam also relies on social engineering. 
We demonstrate that NAB achieves this goal with our re- 
alistic workloads and adaptive bots (86). 


3 NAB Architecture 


We now present the requirements and constraints that 
drive the NAB architecture. 


3.1 Requirements and Constraints 


Requirements. There are four main requirements. First, 
attestations must be generated in response to human re- 
quests automatically. Second, such attestations must not 
be transferable from the client on which they are gen- 
erated to attest traffic originating from another client. 
Third, NAB must benefit users that deploy it without 
hurting those that do not. Fourth, NAB must preserve 
the existing privacy and anonymity semantics of applica- 
tions while delivering these benefits. 

Constraints. NAB has two main constraints. First, the 
host’s OS or applications cannot be trusted. In particu- 
lar, a compromised machine can actively try to subvert 
the attester functionality. Second, the size of the attester 
TCB should be small, because it is a trusted component; 
the smaller a component is, the easier it is to validate it 
operates correctly, which makes it easier to trust. 
Challenge. The key challenge is to meet these require- 
ments without assuming the existence of globally unique 
identities. Even assuming a public-key infrastructure 
(PKI), deploying and managing large-scale identity sys- 
tems that map certificates to users is a daunting prob- 
lem [4]. 

Without such identities, the requirements are hard to 
meet, and, in some cases, even seemingly in conflict 
with each other. For example, generating attestations 
automatically without trusting the OS and applications 
is challenging. Further, there is tension between the re- 
quirement that NAB should benefit its users without hurt- 
ing other users, and the requirement that NAB should 
preserve the existing anonymity and privacy semantics. 
NAB’s attestations are anonymously signed certificates 
of requests, and the membership size of the signing keys 
is several million. We describe how NAB uses such attes- 
tations to overcome the absence of globally unique iden- 
tities in $4.4. 

TPM background. The TPM is a small chip specified 
by the Trusted Computing Group to strengthen the secu- 
rity of computer systems in general. A TPM provides 
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Figure 1: NAB architecture. The thick black line en- 
closes the TCB. 


many security services, among which the ability to mea- 
sure and attest to the integrity of trusted software run- 
ning on the computer at boot time. Since a TPM is too 
slow to be used routinely for cryptographic operations 
such as signing human activity, we use the TPM only for 
its secure bootstrap facilities, to load an attester, a small 
trusted software module that runs on the host processor 
and generates attestations (1.e., messages asserting hu- 
man activity). 

The attester relies on two key primitives provided by 
TPMs. The first is called direct anonymous attesta- 
tion (DAA), which allows the attester to sign messages 
anonymously. Each TPM has an attestation identity key 
(AIK), which is an anonymous key used to derive the at- 
tester’s signing key. The second primitive is called sealed 
storage, which provides a secure location to store the 
attester’s signing key until the attester is measured and 
launched correctly. 


3.2 Architecture 


NAB consists of an attester that runs locally at a host 
and generates attestations, as well as an external verifier 
that validates these attestations (running at a server ex- 
pected to handle spam and DDoS requests, or checking 
for click fraud). The attester code hashes to a well-known 
SHA-1 value, which the TPM measures at launch. The 
attester then listens on the keyboard and mouse ports for 
human activity clicks, and decides whether an attestation 
should be granted to an application when the application 
requests one. If the attester decides to grant an attes- 
tation, the application can submit the attestation along 
with the application request to the verifier for human ac- 
tivity validation. The verifier can confirm human activity 
as long as it trusts the attestation TCB, which consists 
of the attester, the TPM, and input device hardware and 
drivers. This architecture is shown in Figure 1. 
Attestations are signed messages with two key proper- 
ties that enable the verifier to validate them correctly: 


1. Non-transferability. An attestation generated on a 
machine is authenticated by a chain of signing keys 
that pass through that machine’s TPM. Hence, a valid 
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attestation cannot be forged to appear as if it were is- 
sued by an attester other than its creator, and no valid 
attestation can be generated without the involvement 
of a valid attester and TPM chip. 


2. Binding to the content of a request. An attestation 


contains the hash digest of the content of the request 
it is attesting to. Since an attester generates an attes- 
tation only in response to human activity, this binding 
ensures that the attestation corresponds to the content 
used to generate it. Binding thus allows a request to be 
tied as closely as practical to the user’s intent to gen- 
erate that request, greatly reducing opportunities for 
using human activity to justify unrelated requests. 


4 Attester Design and Implementation 


Our attester design assumes no special hardware support 
other than the availability of a TPM device. However, it 
is flexible enough to exploit the recent processor exten- 
sions for trusted computing such as AMD’s Secure Vir- 
tual Machine (SVM) or Intel’s Trusted Execution Tech- 
nology (TXT) to provide additional features such as late 
launch (i.e., non boot-time launch), integration into the 
TCB of an OS, etc., in the future. 

The attester’s sole function is to generate an attestation 
when an application requests one. An attestation request 
contains only the application-specific content to attest to 
(e.g., the email message to send out). The attester may 
provide the attestation or refuse to provide an attestation 
at all. We discuss two important decisions: when to grant 
an attestation and what to attest. 


4.1 When To Grant An Attestation 


The key question in designing the attester is deciding un- 
der what conditions a valid attestation must be granted. 
The goal is to simultaneously ensure that human- 
generated traffic is attested, while all bot-generated traf- 
fic is denied attestation. 

The attester’s decision is one of guessing the human’s 
presence and intent: was there a human operating the 
computer, and did she really intend to send the particular 
email for which the application is requesting an attesta- 
tion? Since the attester lacks a direct link to the human’s 
intentions, it must guess based on the trusted inputs avail- 
able: the keyboard and mouse. We considered three key 
design points for such a guessing module. 

The best-quality guess is not a guess at all: the attester 
could momentarily take over the keyboard, mouse, and 
display device, and prompt the user with a specific ques- 
tion to attest or not attest to a particular email. Since the 
OS and other applications are displaced in the process, 
only the human user can answer the question. From the 
interaction point of view, this approach is similar to the 
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User Account Control (UAC) tool in Microsoft Windows 
Vista, in which the OS prompts the user for explicit ap- 
proval before performing certain operations, although in 
our context it would be the much smaller and simpler 
attester that performs that function. While technically 
feasible to implement, users have traditionally found ex- 
plicit prompts annoying in practice, as revealed by the 
negative feedback on UAC [29]. What is worse, user 
fatigue inevitably leads to an always-click-OK user be- 
havior [32], which defeats the purpose of attestation. 

So, we only consider guesses made automatically. In 
particular, we use implicit guessing of human intent, us- 
ing timing as a good heuristic: how recently before a 
particular attestation request was the last keyboard or 
mouse activity observed? We call this a “t — 0” at- 
tester, if 0,, denotes the time since the last mouse activity 
and 6; denotes the time since the last keyboard activity. 
For example, the email application requests an attesta- 
tion specifying that a keyboard or mouse click should 
have occurred within the last A; or A,,, milliseconds re- 
spectively, where the As; ,,; represents the application- 
specified upper-bound. The attester generates attesta- 
tions that indicate this time lag, or refuses if that lag is 
longer than Ay; ,,;} milliseconds. 

This method is simpler and cheaper in terms of re- 
quired resources than an alternative we carefully consid- 
ered and eventually discarded. Using keyboard activity 
traces, we found that good-quality guesses can be ex- 
tracted by trying to support the content of an attestation 
request using specific recent keyboard and mouse activ- 
ity. For example, the attester can observe and remember 
a short sequential history of keystrokes and mouse clicks 
in order of observation. When a particular attestation re- 
quest comes in, the attester searches for the longest sub- 
sequence of keyclicks that matches the content to attest. 
An attestation could be issued containing the quality of 
match (e.g., a percentage of content matched), only rais- 
ing an explicit alarm and potential user prompting if that 
match is lower than a configurable threshold (say 60%). 
This design point would not attest to bot requests un- 
less they contained significant content overlap with legit- 
imate user traffic. Nevertheless this method raised great 
implementation complexity, given the typical multitask- 
ing behavior of modern users (switching between win- 
dows, interleaving keyboard and mouse activity, insert- 
ing, deleting, selecting and overwriting text, etc.). So, 
we ultimately discarded it in favor of the simpler t — 6 
attester, which allowed a simple implementation with a 
small TCB size. 

One drawback of the t— 60 attester is that it allows a bot 
to generate attestations for its own traffic by “harvesting” 
existing user activity. So, NAB could allow illegitimate 
traffic to receive attestations, though only at the rate of 
human activity. 
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NAB mitigates this situation in two ways. First, NAB 
ensures that two attestations are separated by at least the 
application-specific A milliseconds. For email, we find 
from the traces (86) that A = 1 second works well. Since 
key clicks cannot be captured or stored, we throttle a bot 
significantly in practice. Today’s bots send several tens 
of thousands of spam within a few hours [14], so even an 
adaptive bot is constrained by this rate limit. 

Second, if legitimate traffic fails to receive an attesta- 
tion (e.g., because bot code attestation requests absorbed 
all recent user activity before the user’s application had 
a chance to do so), a NAB-aware application alerts the 
user that it has not been able to acquire an attestation, 
possibly alerting the user that unwholesomeness is afoot 
at her computer. We note that this technique is not per- 
fect, because a bot can hijack such prompts. In practice, 
we found that such feedback is useful, although we eval- 
uate NAB assuming adversarial bots. 


4.2 What To Attest 


The second attester design decision is what to attest, 1.e., 
how much to link a particular attestation to the issuer, the 
verifier, and the content. 

Traditional human activity solutions such as 
CAPTCHAs [30] do not link to the actual request 
being satisfied. A CAPTCHA is a challenge that only a 
human is supposed to be able to respond to. A correct 
response to a CAPTCHA attests to the fact that a human 
was likely involved in answering the question, but it 
does not say where the human was or whether the 
answer came from the user of the service making the 
request. The problem is that human activity can be 
trafficked, as evidenced by spammers who route human 
activity challenges meant for account creation to sketchy 
web sites to have them solved by those sites’ visitors 
in exchange for free content [25], or to sweatshops 
with dedicated CAPTCHA solvers. Thus, a human was 
involved in providing the activity, but not necessarily the 
human intended by the issuer of the challenge. 

In contrast, NAB generates responder-specific, 
content-specific, and, where appropriate, challenger- 
specific attestations. Attestations are certificates of 
human activity that contain a signature over the entire 
request content. For example, an email attestation 
contains the signature over the entire email, including 
the “From:” address (i.e., the responder), the email 
body (i.e., the content), and the “To:” address (i.e., the 
challenger). Similarly, a web request attestation contains 
the URL, which provides both responder-specific and 
content-specific attestations. 

Content-specific attestation is more subtle. Whereas 
CAPTCHAs are used today for coarse-grained actions 
such as email account creation, they are considered too 
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Figure 2: Attester interfaces. 


intrusive to be used for finer granularity requests such as 
sending email or retrieving web URLs. So, in practice, 
the challenge response is “amortized” over multiple re- 
quests (1.e., all email sent from the CAPTCHA-created 
mail account). Even if an actual human created the ac- 
count, nothing prevents the bots in that human’s desktop 
from sending email indiscriminately using that account. 

Finally, challenger-specific attestation helps in ensur- 
ing that unwitting, honest humans do not furnish attes- 
tations for bad purposes. A verifier expecting an attes- 
tation from human A’s attester will reject an attestation 
from human B that might be provided instead. In the 
spam example, this is tantamount to explicit sender au- 
thentication. 

Attestations with these three properties, together with 
application-specific verifier policies described in 85.2, 
meet our second and third requirements (83.1). 


4.3 Attester API 


Figure 2 shows the relationship between the attester and 
other entities. The API is simple: there is only a sin- 
gle request/reply pair of calls between the OS and the 
attester. An application’s attestation request contains the 
hash of the message to be attested (1.e., the contents of an 
email message or the URL of a browser click), the type 
of attestation requested, and the process id (PID) of the 
requesting process. 

If the attester verifies that the type of attestation be- 
ing requested is consistent with user activity seen on the 
keyboard/mouse channels, it signs the attestation and, 
depending on the attestation type, includes 0,, and dx, 
which indicate how long ago a mouse click and a key- 
board click respectively were last seen. The attestation 
is an offline computation , and is thus an instance of a 
non-interactive proof of human activity. 

The same API is used for all applications. The only 
customization allowed is whether to include the values 
of the 0,, or 0%, depending on the attestation type. The 
attester uses a group signature scheme for anonymous at- 
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testations, extending the Direct Anonymous Attestation 
(DAA) service [7] provided by recent TPMs. Anony- 
mous attestations preserve the current privacy semantics 
of web and email, thereby meeting our fourth and final 
requirement (83.1). 

We have currently defined and implemented two attes- 
tation types. Type 0 is for interactive applications such 
as all types of web requests. Type I is for delay-tolerant 
applications such as email. Type 0 attestations are gener- 
ated only when there is either a mouse or keyboard click 
in the last one second, and do not include the 6,,, or 6; 
values. Type 0 attestations are offered as a privacy en- 
hancement, to prevent verifiers from tracking at a fine 
temporal granularity a human user’s activity or a partic- 
ular request’s specific source machine. We chose one 
second as the lag for Type O attestations since it is suffi- 
cient for local interactive applications; for example, this 
is ample time between a key or mouse click and a local 
action such as generating email or transmitting an HTTP 
GET request. Type 1 attestations can be used with all 
applications we have examined, when this finer privacy 
concern is unwarranted. To put the two types in perspec- 
tive, a Type 0 attestation is roughly equivalent to a Type 
1 attestation requested with A,, = A; = 1sec and in 
which the attested 6,,, /6, values have been hidden. 
Attestation structure. An attestation has the form 
(d,n,6m;6k,0,C). It contains a cryptographic con- 
tent digest d (e.g., a SHA-1 hash) of the application- 
specific payload being attested to; a nonce n used to 
maintain the freshness of the attestations and to dis- 
allow improper attestation reuse; the 0;;.,} values 
(for type 1 attestations); the attestation signature 0 = 
sign(K priv, (d,, 0m, Oz)); and a certificate C' from the 
TPM guaranteeing the attester’s integrity, the version of 
the attester being used, the attestation identity key of the 
TPM that measured the attester integrity, and the signed 
attester’s public key Ky.» (Figure 2). The certificate C’ 
is generated during booting of the attester and is stored 
and reused until reboot. 

The mechanism for attesting to web requests is simple: 
when a user clicks on a URL that is either a normal link 
or an ad, the browser requests an attestation on the entire 
page URL. After the browser fetches the page content, it 
uses the same attestation to retrieve any included objects 
within the page. As explained in 85.2, the verifier accepts 
the attestation for all included objects. 

The mechanism for sending email in the common case 
is also straightforward: the entire email message, includ- 
ing headers and attachments, constitutes the request. In- 
terestingly, the same basic mechanism is extensible to 
other email usage scenarios, such as text or web-based 
email, email-over-ssh, batched and offline email, and 
script-generated email. 

Email usage scenarios (mailing lists; remote, batched, 
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offline, scripted or web mail). To send email to mail- 
ing lists, the attester attests to the email normally, except 
that the email destination address is the name of the tar- 
get mailing list. Every recipient’s verifier then checks 
that the recipient is subscribed to the mailing list, as de- 
scribed in 85.2. Also, a text-based email application run- 
ning remotely over ssh can obtain attestations from the 
local machine with the help of the ssh client program 
executing locally. This procedure is similar to authenti- 
cation credential forwarding implemented in ssh. Simi- 
larly, a graphical email client can obtain and store an at- 
testation as soon as the “send” button is clicked, regard- 
less of whether it has a working network connection, or 
if the email client is in an offline mode, or if the client 
uses an outbox to batch email instead of sending it im- 
mediately. In case of web mail, a browser can obtain an 
attestation on behalf of the web application. 

Script-generated email is more complex. The PID 
argument in the attestation request (Figure 2) is used 
for deferred attestations, which are attestations approved 
ahead of time by the user. Such forms of attestation 
are not required normally, and are useful primarily for 
applications such as email-generating scripts, cron-jobs, 
etc. When an application requests a deferred attestation, 
the user approves the attestation explicitly through a re- 
served click sequence (currently “Ctl-Alt-F4’, followed 
by number of deferred attestations). These attestations 
are stored in a simple PID-table in the attester, and re- 
leased to the application in the future. Since the content 
of a deferred attestation is not typically known until later 
(such as when the body of an email is dynamically gen- 
erated), it is dangerous to release an unbound attestation 
to the untrusted OS. Instead, the attester stores the de- 
ferred attestations in its own memory, and releases only 
bound attestations. Although the attester ensures that un- 
bound attestations are not released to the untrusted OS, 
thereby limiting damage, there is no way to ensure that 
these attestations are not stolen by a bot faking the legit- 
imate script’s PID. However, the user is able to reliably 
learn about the missing attestations after this occurrence, 
which is helpful during troubleshooting. 


4.4 Attester Implementation 


The attester is a small module, currently at fewer than 
500 source code lines. It requires a TPM chip conform- 
ing to any revision of the TPM v1.2 specification [28]. 
Attester installation. The attester is installed by bind- 
ing its hash value to an internal TPM register called a 
Platform Configuration Register (PCR). We use PC’ Rj . 
Initially, the register value is -1. We extend it! with the 
attester through the TPM operation: 


PCRExtend(18, H(ATT)) 
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where H(ATT’) is the attester’s hash. If the attester 
needs to be updated for some reason (which should be 
a rare event), PC’Rg is reinitialized and extended with 
the new code value. 

Key generation. At install time, the attester generates 
an anonymous signing key pair: {K pup, Kpriv}. This 
key pair is derived from the attestation identity key AIK 
of the TPM, and is an offline operation. Kp,;, allows the 
attester to sign requests anonymously. The attester then 
seals the private key K,,,;, to the TPM using the TPM’s 
private storage root key K,90¢. 

Assume that the system BIOS, which boots before the 
attester, extends PC’R,7. Thus, the sealing operation 
renders [y,i, inaccessible to everyone but the attester 
by executing the TPM call: 


Seal( (17,18); Kjrie) 


which returns the encrypted value C' of Ky,i,. The TPM 
unseals and releases the key only to the attester, after the 
attester is booted correctly. 

Until the TPM releases K,,,;,, and the accompanying 

certificate to the attester, there is thus no way for the host 
to prove to an external verifier that a request is accompa- 
nied by human activity. Conversely, if the attester has a 
valid private key, the external verifier is assured that the 
attester is not tampered with. 
Attester booting. The attester uses a static chain of trust 
rooted at the TPM and established at boot-time. It is 
booted as part of the secure boot loading operation be- 
fore the untrusted OS itself is booted. After the BIOS 
is booted, it measures and launches the attester. After 
the attester is launched, it unseals the previously sealed 
Koriy by executing: 


Unseal(C, MAC x,,,., (17, PC’ R17), (18, PC'Rig))) 


The Unseal operation releases Kp,i, only if the PCR 
registers 17 and 18 after reboot contain the same hash 
values as the registers at the time of sealing Ky. If 
the PCR values match, the TPM decrypts C’ and returns 
K priv to the attester. 

Thus, by sealing the anonymous signing key Kp, to 

the TPM and using secure boot loading to release the key 
to the attester, NAB meets the challenge of generating 
attestations without globally unique identities. 
Attester execution. The attester waits passively for at- 
testation requests from an application routed through the 
untrusted OS. A small untrusted stub is loaded into the 
OS in order to interact with the attester on behalf of the 
application. 

With our current attester design and implementation, 
applications need to be modified in order to obtain attes- 
tations. We find the modifications to be fairly small and 


localized (86). The only change as far as applications 
are concerned is to first obtain appropriate attestations 
and then include them as part of the requests they submit 
today. Protocols such as SMTP (mail) or HTTP (web) 
need not be modified in order to include this function- 
ality. SMTP allows extensible message headers, while 
HTTP can include the attestation as part of the “user 
agent” browser string or as an extended header. 


5 Verifier Design and Implementation 


We now describe how verifiers use attestations to imple- 
ment attack-specific countermeasures for spam, DDoS 
and click-fraud. 


5.1 Verifier Design 


The verifier is co-located with the server processing re- 
quests. We describe how the server invokes the verifier 
for each application in 85.2. When invoked, the verifier 
is passed both the attestation and the request. The attes- 
tation and request contain all the necessary information 
to validate the request. 

The verifier first checks the validity of the attester pub- 
lic key used for signing the request, by traversing the 
public-key chain in the certificate C' (Figure 2). If valid, 
it then recomputes the hash of the request’s content and 
verifies whether the signed hash value in the attestation 
matches the request’s contents. Further, for attestations 
that include the 04;,,,; values, the verifier also checks 
whether 05%} are less than the application-specified 
Ask,m}- The verifier then checks to ensure that the at- 
testation is not being double-spent, as described in 8 5.3. 

A bot running in an untrusted domain cannot masquer- 
ade as a trusted attester to the verifier because a TPM 
will not release the signed K,,») (Figure 2) to the bot 
without the correct code hash. Further, it derives no ben- 
efit from tampering with the 0 values it specifies in its 
requests, because the verifier enforces the application- 
specified upper-limit on 04; mm}. 

The verifier then implements an application-specific 
policy as described next. 


5.2 Application-specific Policies 


Verifiers implement application-specific policies to deal 
with bot traffic. Spam can be more aggressively filtered 
using information in the attestations, legitimate email 
with attestations can be correctly classified, DDoS can 
be handled more effectively by prioritizing requests with 
attestations over traffic without attestations, and click- 
fraud can be reduced by only serving requests with valid 
attestations and ignoring other requests. 
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Figure 3: Sender ISP’s verifier algorithm. 


We now describe how the verifier implements such 
application-specific policies. Note that these are only 
example policies that we constructed for our three case 
studies, and many others are possible. 


5.2.1 Spam policy 


The biggest problem with Bayesian spam filters such as 
spamassassin today is that they either flag too much le- 
gitimate email as spam, or flag too little spam as such. 

When all legitimate requests are expected to carry at- 

testations, the verifier can set spam filters aggressively to 
flag questionable unattested messages as spam, but use 
positive evidence of human activity to “whitelist” ques- 
tionable attested messages. 
Sender ISP’s email server. The verifier sits on the 
sender ISP’s server alongside a Bayesian spam filter like 
spamassassin. The filter is configured at an aggressive, 
low threshold (e.g., -2 instead of the default 5 for spa- 
massassin), because the ISP can force its users to send 
email with attestations, in exchange for relaying email 
through its own servers. 

This low spamassassin “required score’’ threshold (or 
sa_fltr_thr in Figure 3) tags most unattested spam 
as unwanted. However, in the process, it might also tag 
some valid email as spam. In order to correct this mis- 
take, the verifier “salvages” messages with a high spam 
filter score that carry a valid attestation, and relays them; 
high-score, unattested email is discarded as spam. This 
step ensures that legitimate human-generated email is 
forwarded unconditionally, even if the sender’s machine 
is compromised. Thus, NAB guarantees that human- 
generated email from even a compromised machine 1s 
forwarded correctly (for example, in our trace study in 
$6, we did not find a single legitimate email that was ul- 
timately rejected). Finally, while spam that steals attes- 
tations will also be forwarded, in our trace-based study 
this spam volume is 92% less than the spam forwarded 
today (86). This reduction is because the attester limits 
the bot to acquiring attestations only when there is hu- 
man activity, and even then at a rate limit of at most one 
attestation per A (one second for type 0 attestations). 
Recipient’s inbox. A second form of deploying the 
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verifier is at the email recipient. This form can coexist 
with the verifier on the sender’s side. 

We observe that any email server classifying email as 
spam or not can ensure that a legitimate email is not mis- 
classified by improving the spam score for email mes- 
sages with attestations by a small number (=3, 86). This 
number should be high enough that all legitimate email 
is classified correctly, while spam with or without attes- 
tations is still caught. 

The verifier improves the score for all attested emails 

by 3, thereby vastly improving the delivery of legitimate 
email. Additionally, in this deployment, the verifier also 
checks that the “To:” or “Cc:” headers contain the recipi- 
ent’s email address or the address of a subscribed mailing 
list. If not (for example, in the case of “Bcc:”’), it does 
not improve the spam score by 3 points. 
Incentives. Email senders have an incentive to deploy 
NAB because it prevents their email from being misclas- 
sified as spam. Verifiers can be deployed either for reduc- 
ing spam forwarded through mail relays or for ensuring 
that all legitimate email is classified and delivered cor- 
rectly. Home ISPs, which see significant amount of com- 
promised hosts on their networks, can benefit from the 
first deployment scenario, because, unlike other meth- 
ods of content or IP-based filtering, attestations still al- 
low all legitimate email from compromised hosts, while 
reducing spam significantly (86). Also, web-based mail 
servers such as gmail have an incentive to deploy NAB so 
that they can avoid being blacklisted by other email re- 
lays by reducing the spam they forward today. Finally, 
email recipients have an incentive to deploy NAB be- 
cause they will receive all legitimate email correctly, un- 
like today (86). 


5.2.2. DDoS policy 


We consider scenarios where DDoS is effected by over- 
loading servers, and not by flooding networks. The ver- 
ifier resides in a firewall or load balancer, and observes 
the response time of the web server to determine whether 
the server is overloaded [31]. Here, unlike in spam, the 
verifier does not drop requests with invalid or missing 
attestations. Instead, it prioritizes requests with valid at- 
testations over those that lack them. Prioritizing, rather 
than dropping, makes sense because some valid requests 
may actually be generated automatically by machines 
(for example, automatic page refreshes on news sites like 
cnn.com). 

The verifier processes the web request in the following 
application-specific manner. If the request is for a page 
URL, the verifier treats it as a fresh request. It keeps a set 
of all valid attestations it has seen in the past 10 minutes, 
and adds the attestation and the requested page URL to 
the list. If the request is for an embedded object within a 
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page URL, the verifier searches the attestation list to see 
if the attestation is present in the list. If the attestation 
is present in the list, and if the requested object belongs 
to the page URL recorded in the list for the attestation, 
the verifier treats the attestation as valid. Otherwise, it 
lowers the priority of the request. The verifier ages the 
stored attestation list every minute. 


The priority policy serves all outstanding attested re- 
quests first, and uses any remaining capacity to serve all 
unattested requests in order. 


Incentives. Overloaded web sites have a natural incen- 
tive to deploy verifiers. While users have an incentive to 
deploy attesters to receive priority treatment, the attester 
deployment barrier can be still high. However, since our 
attester is not application-specific, it is possible for the 
web browser to leverage the attester deployed for email 
or click-fraud. 


5.2.3. Click-fraud Policy 


Click-fraud occurs whenever an automated request is 
generated for a click, without any interest in the click 
target. For example, a botmaster puts up a web site to 
show ads from companies such as Google, and causes 
his bots to fetch ads served by Google through his web 
site. This action causes Google to pay money to the bot- 
master. Similarly, an ad target’s competitor might gener- 
ate invalid clicks in order to run up ad costs and bankrupt 
the ad purchaser. Further, the competitor might be able to 
purchase ad words for a smaller price, because the victim 
might no longer bid for the same ad word. Finally, com- 
panies like Google have a natural incentive to prove to 
their advertisers that ads displayed together with search 
results are clicked not by bots but by humans. 


With NAB, a verifier such as Google can implement 
the verifier within its web servers, configured as a sim- 
ple policy of not serving unattested requests. Also, it can 
log all attested requests to prove to the advertiser that 
the clicks Google is charging for are, in fact, human- 
generated. 


Incentives. Companies like Google, Yahoo and Mi- 
crosoft that profit from ad revenue have a good incentive 
to deploy verifiers internally. They also have an incen- 
tive to distribute the attester as part of browser toolbars. 
Such toolbars are either factory installed with new PCs, 
or the user can explicitly grant permission to install the 
attester. While the user may not benefit directly in this 
case, she benefits from spam and DDoS reduction, and 
from being made aware of potential problems when a bot 
steals key clicks. 


5.3. Security guarantees 


NAB provides two important security guarantees. First, 
it ensures that attestations cannot be double-spent. Sec- 
ond, it ensures that a bot cannot steal key clicks and ac- 
cumulate attestations beyond a fixed time window, which 
reduces the aggregate volume and burstiness of bot traf- 
fic. 

The verifier uses the nonce in the attestation (Figure 2) 
for these two guarantees. The verifier stores the nonces 
for a short period (10 minutes for web requests, one 
month for email). We find this nonce overhead to be 
small in practice (86.3). If a bot recycles an attestation 
after one month, and the spam filter at the verifier flags 
the email as spam based on content analysis, the veri- 
fier uses the “Date:” field in the attested email to safely 
discard the request because the message 1s old. 

The combination of application-specific verifier pol- 
icy and content-bound attestations can also be used to 
mitigate bursty attacks. For example, a web URL can in- 
clude an identifier that encodes the link freshness. Since 
attestations include the identifier, the verifier can discard 
out-of-date requests, even if they have valid signatures. 


6 Evaluation 


In this section, we evaluate NAB’s two main compo- 
nents: a) our current attester prototype with respect to 
metrics such as TCB size, CPU requirements, and appli- 
cation changes; and b) our verifier prototype with respect 
to metrics such as the extent to which it mitigates attack- 
specific traffic such as spam, DDoS and click-fraud, and 
the rate at which it can verify attestations. 

Our main experiments and their conclusions are shown 
in Table 1. We elaborate on each of them in turn. 


6.1 Attester Evaluation 


TCB size. We implemented the attester as a kernel mod- 
ule within Xen. Xen is well-suited because it provides a 
virtual machine environment with sufficient isolation be- 
tween the attester and the untrusted OS. However, the 
chief difficulty was keeping the total TCB size small. 
Striving for a small TCB allows the attester to handle 
untrusted OSes with a higher assurance. While the Xen 
VM itself is small (about 30 times smaller than the Linux 
kernel), we have to factor the size of a privileged do- 
main such as Domain-0 into the TCB code base. Unfor- 
tunately, this increases the size of the TCB to more than 
5 million source lines of code (SLOC), the majority of 
which is device driver code. 

Instead, we started with a minimal kernel that only 
includes the necessary drivers for our platform. We in- 
cluded the Xen VMM and built untrusted guest OSes us- 
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Experiment Conclusion 


TCB size 


500 source lines of code (SLOC) for attester, 30K SLOC total 


Attester CPU cost 

Application changes 

Worst-case spam mitigation 
Worst-case DDoS mitigation 
Worst-case click-fraud mitigation 
Verifier throughput 


< 10‘ instructions/attestation 

<250 SLOC for simple applications 

> 92% spam suppressed; no human-sent email missed 

> 89% non-human requests identified; no human requests demoted 
> 87% automated clicks denied; no human request denied 

> 1,000 req/s. Scalable to withstand 100,000-bot DDoS 


Table 1: Summary of key experiments and their results. 


ing the mini-OS [19] domain building facility included 
in the Xen distribution. Muni-OS allows the user-space 
applications and libraries of the host VM to be untrusted, 
leaving us with a total codebase of around 30,000 source 
lines of code (SLOC) for the trusted kernel, VMM and 
attester. Our attester was less than 500 SLOC. While this 
approach produced a TCB that can be considered rea- 
sonably small, especially compared to the status quo, we 
are examining alternatives such as using Xen’s driver do- 
main facility that allows device drivers to run in unprivi- 
leged domains. We are also working on using the IOM- 
MUs found on the newer Intel platforms, which enable 
drivers for devices other than keyboard and mouse to run 
in the untrusted OS, while ensuring that the attester can- 
not be corrupted due to malicious DMA requests. Such 
an approach makes the attester portable to any x86 plat- 
form. 

Attester CPU cost. The attester uses RSA signatures 
with a 1024-bit modulus, enabling it to generate and re- 
turn an attestation to the application with a worst-case 
latency of 10 ms on a 2 GHz Core 2 processor. This 
latency is usually negligible for email, ad click, or fetch- 
ing web pages from a server under DDoS. Establishing 
an outgoing TCP connection to a remote server usually 
takes more than this time, and attestation generation is 
interleaved with connection establishment. 

Application changes. We modified two command-line 
email and web programs to request and submit attes- 
tations: NET::SMTP, a Perl-based SMTP client, and 
CURL, an HTTP client written in C. Both modifications 
required changes or additions of less than 250 SLOC. 


6.2 Verifier Evaluation 


We used a trace study of detailed keyboard and mouse 
activity of 328 volunteering users at Intel to confirm 
the mitigation efficacy of our application-specific veri- 
fier policies. We find the following four main benefits 
with our approach: 


1. If the sender’s mail relay or the receiver’s inbox uses 
NAB and checks for attestations, the amount of spam 
that passes through tuned spam filters (.e., false neg- 
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atives) reduces by more than 92%, while not flagging 
any legitimate email as spam (i.e., no false positives). 
The spam reduction occurs by setting the “scoring 
thresholds” aggressively; the presence of concomitant 
human activity greatly reduces the number of legiti- 
mate emails flagged as spam. 


2. In addition to reduced spam users see in their inboxes, 


NAB also reduces the peak processing load seen at 
mail servers, because the amount of attested spam that 
can be sent even by an adaptive botnet is bounded by 
the number of human clicks that generate attestations. 
Hence, mail servers can prioritize attested requests po- 
tentially dropping low-priority ones, which improves 
the fraction of human-generated email processed dur- 
ing high-load periods. 


3. NAB can filter out more than 89% of bot-mounted 


DDoS activity without 
generated requests. 


misclassifying human- 


4. NAB can identify click-fraud activity generated by ad- 


ware with more than 87% accuracy, without losing any 
human-generated web clicks. 


Methodology. We use the keyboard and mouse click 
traces collected by Giroire et al. [11]; activity was 
recorded on participants’ laptops at one-second granu- 
larity at all times, both at work and at home. Each user’s 
trace 1s a sequence of records with the following rele- 
vant information: timestamp; number of keyboard clicks 
within the last second; number of mouse clicks within the 
last second; the foreground application that is receiving 
these clicks (such as “Firefox’’, “Outlook’’, etc.); and the 
user’s network activity (i.e., the TCP flow records that 
were initiated in the last one second). Nearly 400 users 
participated in the trace study, but we use data from 328 
users because some users left the study early. These 328 
users provide traces continuously over a one-month pe- 
riod between Jan—Feb 2007, as long as their machines 
were powered on. While the user population size is mod- 
erate, the users and the workloads were diverse. For ex- 
ample, there were instances of significant input device 
activity corresponding to gaming activity outside regular 
work. So, we believe the traces are sufficiently represen- 
tative of real-world activity. 


USENIX Association 


USENIX Association 


Separately, we also collected malware traces from a 
honeypot. The malware whose traces we gathered in- 
cluded: a) the Storm Worm [13], which was until re- 
cently the largest botnet, generating several tens of bil- 
lions of spam messages per day; and b) three adware 
bots called 180solutions, Nbcsearch and eXactSearch, 
which are known to perpetrate click-fraud against Ya- 
hoo/Overture. For spam, we also used a large spam cor- 
pus containing more than 100,000 spam messages and 
50,000 valid messages [1]. Each message in the corpus is 
hand-classified as spam or non-spam, providing us with 
ground-truth. For DDoS, we use traffic traces from the 
Internet Traffic Archive [27], which contain flash-crowd 
scenarios. We assume that these flash crowds represent 
DDoS requests, because, as far as an overloaded server 
is concerned, the two scenarios are indistinguishable. 

We overlay the user activity traces with the malware 
and DDoS traces for each user, and compare the results 
experienced by the user at the output of the verifier with 
and without attestations. We consider two strategies for 
overlaying requests: a normal bot and an adaptive bot. 
The adaptive bot represents the worst-case scenario for 
the verifier, because it monitors human activity and mod- 
ulates its transmissions to collect attestations and mas- 
querade as a user at the verifier. 

We consider an adaptive adversary that buffers its re- 

quests until it sees valid human activity, and simulate the 
amount of benefit NAB can provide under such adversar- 
ial workloads. 
Spam mitigation. The verifier can be used in two ways 
(85.2). First, mail relays such as gmail or the SMTP 
server at the user’s ISP can require attestations for outgo- 
ing email. In this case, the main benefit comes from fil- 
tering out all unattested spam and catching most attested 
spam, while allowing all legitimate email. So, the main 
metric here is how much attested spam is suppressed. 
Second, the inbox at the receiver can boost the “spam 
score” for all attested email, thereby improving the prob- 
ability that a legitimate email is not misclassified. So, the 
main metric here is how much attested human-generated 
email is misclassified as spam. 

Figure 4 shows the amount of spam, attested or not, 
that managed to sneak through spamassassin’s Bayesian 
filter for a given spam threshold setting. By setting a 
spam threshold of -2 for an incoming message , and ad- 
mitting messages that still cleared this threshold and car- 
ried valid attestations, we cut down the amount of spam 
forwarded by mail relays by more than 92% compared to 
the amount of spam forwarded currently. 

From our traces, we also found that no attested human- 
generated email is misclassified as spam for a spam 
threshold setting of 5, as long as the spam score of at- 
tested messages is boosted by 3 points. On the other 
hand, spamassassin uses a threshold of 5 by default be- 
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Figure 4: Missed spam percentage vs. spam threshold 
with attestations. By setting spam threshold to -2, spam 
cleared by spamassassin and received in inboxes today 1s 
reduced more than 92% even in worst case (1.e., adaptive 
bots), without missing any legitimate email. 


cause, without attestations, a lot of valid email would be 
missed if it were to use a spam score of -2. Even so, about 
0.08% of human-generated email is still misclassified as 
spam, which is a significant improvement of legitimate 
email reception. 


There is another benefit that the verifier can derive by 
using attestations. It comes in the form of reduced peak 
load observed while processing spam. ‘Today’s email 
servers are taxed by ever-increasing spam requests [23]. 
At peak times, the mail server can prioritize messages 
carrying attestations over those that do not, and process 
the lower-priority messages later. 


Figure 5 shows the CDF of the percentage of spam re- 
quests that the verifier must still service at a high priority 
because of stolen attestations. NAB demotes spam traffic 
without attestations by more than 91% in the worst case 
(equivalently, less than 7.5% of spam traffic is served at 
the high priority). At the same time, no human-generated 
requests are demoted. The mean of the admitted spam 
traffic is 2.7%, and the standard deviation is 1.3%. Thus, 
NAB reduces peak server load by more than 10x. 


DDoS mitigation. The verifier uses the DDoS policy 
described in 85.2, by giving lower priority to requests 
without attestations. Figure 6 shows the CDF of the per- 
centage of DDoS requests that the verifier still serves at 
a high priority because of stolen attestations. NAB de- 
motes DDoS traffic by more than 89% in the worst case 
(equivalently, only 11% of DDoS traffic is served at the 
high priority). At the same time, no human-generated 
requests are demoted. The mean of the admitted DDoS 
traffic is 5.8%, and the standard deviation is 2.2%. 

Click-fraud mitigation The verifier uses the Click-fraud 
policy described in 85.2. Figure 7 shows the amount 
of click-fraud requests that the verifier satisfies due to 
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Figure 5: CDF of percentage of bots’ spam requests ser- 
viced by an email server in the worst case. The mail 
server’s peak spam processing load is reduced to less 
than 7.5% of its current levels. 
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Figure 6: CDF of percentage of bots’ DDoS requests 
serviced in the worst case. Allowed DDoS traffic is re- 
stricted to less than 11% of original levels. 


valid attestations. NAB denies more than 87% of all 
in the worst case (equivalently, only 13% of all click- 
fraud requests is serviced). At the same time, no human- 
generated requests are denied service. The mean of the 
serviced click-fraud traffic is 7.1%, and the standard de- 
viation is 3.1%. 


6.3. Verifier Throughput 


The verifier processes attestations, which are signed RSA 
messages, at a rate of more than 10,000 attestations per 
second on a 2 GHz Core 2 processor. It benefits from 
the fact that RSA verification is several times faster than 
signing. The verifier processes an attestation by consult- 
ing the data base of previously seen nonces within an 
application-specific period. The longest is email, with 
a duration of one month, while nonces of web requests 
are stored for 10 minutes, and fit in main memory. Even 
in the worst-case scenario of a verifier at an ISP’s busy 
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Figure 7: CDF of percentage of bots’ click-fraud re- 
quests serviced in the worst case. Serviced click-fraud 
requests are restricted to less than 13% of original levels. 


SMTP relay, the storage and lookup costs for the nonces 
are modest—for a server serving a million clients, each 
of which sends a thousand emails per day, the nonce stor- 
age overhead 1s around 600 GB, which can fit on a single 
disk and incur one lookup overhead. This overhead is 
modest compared to the processing and storage costs in- 
curred for reliable email delivery. 

Another concern is that the verifier is itself susceptible 
to a DDoS attack. To understand how well our verifier 
can withstand DDoS attacks, we ran experiments on a 
cluster of 10 Emulab machines configured as distributed 
email verifiers. We launched a DDoS from bots with 
fake attestations. Each DDoS bot sent 1 req/s to one of 
the ten verifiers at random, in order to mimic the behav- 
ior of distributed low-rate bots forming a DDoS botnet. 
Our goal was to determine whether a botnet of 100,000 
nodes (which is comparable to the median botnet size) 
can overwhelm this verifier infrastructure or not. Our 
bot implementation used 100 clients to simulate 1000 
bots each, and attack the ten verifier machines. We as- 
sume network bandwidth is not a bottleneck, and that the 
bots are targeting the potential verification overhead bot- 
tleneck. A verifier queues incoming requests until it can 
attend to it, and has sufficient request buffers. 

Figure 8 shows the latency increase (in ms) experi- 
enced by a normal client request. Normally, a user takes 
about 1 ms to get her attestation verified. With DDoS, we 
find that even a 100,000-node botnet degrades the perfor- 
mance of a normal request only by an additional 1.2 ms 
at most. Hence, normal request processing is not affected 
significantly. Thus, a cluster of 10 verifiers can withstand 
a 100,000-node botnet using fake attestations. 


7 Related Work 


We classify prior work into three main categories. 
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Figure 8: Request processing latency at the verifier. 


Human activity detection. CAPTCHAs [30? ] are 
the currently popular mechanism for proving human 
presence to remote verifiers. However, as described 
in $4, they suffer from four major drawbacks that ren- 
der them less attractive for mitigating botnet attacks. 
First, CAPTCHAs as they are used today are transfer- 
able and not bound to the content they attest, and are 
hence vulnerable to man-in-the-middle attacks, although 
one could imagine designs to improve this shortcom- 
ing; second, they are semantically independent of the ap- 
plication (1.e., unbound to the user’s intent), are hence 
exposed to human solver attacks; third, they are ob- 
trusive, which restricts their use for fine-grained attes- 
tations (by definition, CAPTCHAs require manual hu- 
man input), and hence cannot be automated, unlike NAB. 
Also, we are witnessing continued successes in breaking 
the CAPTCHA implementations of several sites such as 
Google, Yahoo, and MSN [12], leading some to question 
even their long-term viability [34], at least in their cur- 
rent form. By contrast, NAB’s security relies on cryp- 
tographic protocols such as RSA that have been studied 
and used longer. 


The recent work on the Nexus operating system [33] 
has developed support for application properties to be se- 
curely expressed using a trusted reference monitor mech- 
anism. The Nexus reference monitor 1s more expressive 
than a TPM implementing a hash-based trusted boot. So, 
it allows policies restricting outgoing email only from 
registered email applications. In contrast, we assume 
commodity untrusted OS and applications. 


The approach of using hardware to enable human ac- 
tivity detection has been described before in the context 
of on-line games, using untrusted hardware manageabil- 
ity engines (such as Intel’s AMT features) [21]. 
Mitigating spam, DDoS and click-fraud. There is 
extensive literature related to mitigation techniques for 
Spam [2], DDoS [20, 35] and click-fraud [26]. There 
are still no satisfactory solutions, so application-specific 


defenses are continuously proposed. For example, Oc- 
cam [10], SPF (Sender Policy Framework), DKIM (Do- 
mainKeys Identified Mail) and “bonded sender” [6] 
have been put forth recently as enhancements. Simi- 
larly, DDoS and click-fraud mitigation have each seen 
several radically different attack-specific proposals re- 
cently. These proposals include using bandwidth-as- 
payment [31], path validation [35], and computational 
proofs of work [20] for DDoS; and using syndicators, 
premium clicks, and clickable CAPTCHAs for click- 
fraud [26]. 

While all these proposals certainly have several mer- 

its, we propose that it is possible to mitigate a vari- 
ety of botnet attacks using a uniform mechanism such 
as NAB’s attestation-based human activity verification. 
Such a uniform attack mitigation mechanism amortizes 
its cost of deployment. Moreover, unlike some propos- 
als, NAB does not rely on IP-address blacklisting, which 
is unlikely to work well because even legitimate requests 
from a blacklisted host are denied. Also, NAB can be im- 
plemented purely at the end hosts, and does not require 
Internet infrastructure modification. 
Secure execution environments. The TPM specifica- 
tions [28] defined by the Trusted Computing Group are 
aimed at providing primitives that can be used to pro- 
vide security guarantees to commodity OSes. TPM-like 
services have been extended to OSes that cannot have 
exclusive access to a physical TPM device of their own, 
as with legacy and virtual machines. For example, Pio- 
neer [22] provides an externally verifiable code execution 
environment for legacy devices similar to that provided 
by a hardware TPM, and vITPM [5] provides full TPM 
services to multiple virtualized OSes. NAB assumes a 
single OS and a hardware TPM, but can leverage this re- 
search in future. 

XOM [15] and Flicker [16] provide trusted execu- 
tion support even when physical devices such as DMA 
or, with XOM, even main memory are corrupted, while 
SpyProxy [18] blocks suspicious web content by exe- 
cuting the content in a virtual machine first. In con- 
trast, NAB assumes compromised machines’ hardware 
is functioning correctly, that the bot may generate di- 
verse traffic such as spam and DDoS, and that owners do 
not mount hardware attacks against their own machines, 
which is realistic for botted machines. 


8 Conclusions 


This paper presented NAB, a system for mitigating net- 
work attacks by using automatically obtained evidence of 
human activity. NAB uses a simple mechanism centered 
around TPM-backed attestations of keyboard and mouse 
clicks. Such attestations are responder- and content- 
specific, and certify human activity even in the absence 
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of globally unique identities. Application-specific ver- 
ifiers use these attestations to implement various poli- 
cies. Our implementation shows that it is feasible to pro- 
vide such attestations at low TCB size and runtime cost. 
By evaluating NAB using trace analysis, we estimate 
that NAB can reduce the amount of spam evading tuned 
spam filters by more than 92% even with worst-case ad- 
versarial bots, while ensuring that no legitimate email 
is misclassified as spam. We realize similar benefits 
for DDoS and click-fraud. Our results suggest that the 
application-independent abstraction provided by NAB 
enables a range of verifier policies for applications that 
would like to separate human-generated requests from 
bot traffic. 
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Notes 


'The TPM terminology uses the term register extension to imply 
appending a new value to the hash chain maintained by that register. 
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Abstract 


Network security applications often require analyzing 
huge volumes of data to identify abnormal patterns or 
activities. The emergence of cloud-computing models 
opens up new opportunities to address this challenge by 
leveraging the power of parallel computing. 

In this paper, we design and implement a novel sys- 
tem called BotGraph to detect a new type of botnet spam- 
ming attacks targeting major Web email providers. Bot- 
Graph uncovers the correlations among botnet activities 
by constructing large user-user graphs and looking for 
tightly connected subgraph components. This enables us 
to identify stealthy botnet users that are hard to detect 
when viewed in isolation. To deal with the huge data 
volume, we implement BotGraph as a distributed appli- 
cation on a computer cluster, and explore a number of 
performance optimization techniques. Applying it to two 
months of Hotmail log containing over 500 million users, 
BotGraph successfully identified over 26 million botnet- 
created user accounts with a low false positive rate. The 
running time of constructing and analyzing a 220GB Hot- 
mail log is around 1.5 hours with 240 machines. We be- 
lieve both our graph-based approach and our implemen- 
tations are generally applicable to a wide class of security 
applications for analyzing large datasets. 


1 Introduction 


Despite a significant breadth of research into botnet de- 
tection and defense (e.g., [8, 9]), botnet attacks remain 
a serious problem in the Internet today and the phenom- 
enon is evolving rapidly ( [4, 5, 9, 20]): attackers con- 
stantly craft new types of attacks with an increased level 
of sophistication to hide each individual bot identities. 

One recent such attack is the Web-account abuse at- 
tack [25]. Its large scale and severe impact have re- 
peatedly caught public media’s attention. In this attack, 
spammers use botnet hosts to sign up millions of user ac- 
counts (denoted as bot-users or bot-accounts) from major 
free Web email service providers such as AOL, Gmail, 
Hotmail, and Yahoo!Email. The numerous abused bot- 
accounts were used to send out billions of spam emails 
across the world. 

Existing detection and defense mechanisms are inef- 
fective against this new attack: The widely used mail 
server reputation-based approach is not applicable be- 
cause bot-users send spam emails through only legitimate 


“The work was done while Yao was an intern at Microsoft Research 
Silicon Valley. 


Web email providers. Furthermore, it is difficult to differ- 
entiate a bot-user from a legitimate user individually, as 
both users may share a common computer and that each 
bot-user sends only a few spam emails !. 

While detecting bot-users individually is difficult, de- 
tecting them as an aggregate holds the promise. The ratio- 
nal is that since bot-users are often configured similarly 
and controlled by a small number of botnet commanders, 
they tend to share common features and correlate each 
other in their behavior such as active time, spam con- 
tents, or email sending strategies [24, 27]. Although this 
approach is appealing, realizing it to enable detection at 
a large scale has two key challenges: 


e The first is the algorithmic challenge in finding sub- 
tle correlations among bot-user activities and distin- 
guishing them from normal user behavior. 

e The second challenge is how to efficiently analyze 
a large volume of data to unveil the correlations 
among hundreds of millions of users. This requires 
processing hundreds of gigabytes or terabytes of 
user activity logs. 


Recent advancement in distributed programming 
models, such as MapReduce [6], Hadoop [2], and 
Dryad/DryadLINQ [10, 29], has made programming and 
computation on a large distributed cluster much easier. 
This provides us with opportunities to leverage the paral- 
lel computing power to process data in a scalable fashion. 
However, there still exist many system design and imple- 
mentation choices. 

In this paper, we design and implement a system called 
BotGraph to detect the Web-account abuse attack at a 
large scale. We make two important contributions. 

Our first contribution is to propose a novel graph- 
based approach to detect the new Web-account abuse at- 
tack. This approach exposes the underlying correlations 
among user-login activities by constructing a large user- 
user graph. Our approach is based on the observation that 
bot-users share IP addresses when they log in and send 
emails. BotGraph detects the abnormal sharing of IP ad- 
dresses among bot-users by leveraging the random graph 
theory. Applying BotGraph to two months of Hotmail 
log of total 450GB data, BotGraph successfully identified 
over 26 million bot-accounts with a low false positive rate 
of 0.44%. To our knowledge, we are the first to provide a 


'Recent anecdotal evidence suggests that bot-users have also been 
programmed to receive emails and read them to make them look more 
legitimate. 
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systematic solution that can successfully detect this new 
large-scale attack. 

Our second contribution is an efficient implementa- 
tion using the new distributed programming models for 
constructing and analyzing large graphs. In our applica- 
tion, the graph to construct involves tens of millions of 
nodes and hundreds of billions of edges. It is challeng- 
ing to efficiently construct such large graphs on a com- 
puter cluster as the task requires computing pair-wise cor- 
relations between any two users. We present two graph 
construction methods using different execution plans: the 
simpler one is based on the MapReduce model [6], and 
the other performs selective filtering that requires the 
Join operation provided by Map-Reduce-Merge [28] or 
DryadLINQ [29]. By further exploring several perfor- 
mance optimization strategies, our implementation can 
process a one-month dataset (220GB-240GB) to con- 
struct a large graph with tens of millions of nodes in 1.5 
hours using a 240-machine cluster. The ability to effi- 
ciently compute large graphs is critical to perform con- 
stant monitoring of user-user graphs for detecting attacks 
at their earliest stage. 

Our ultimate goal, however, is not to just tackle this 
specific new form of attacks, but also to provide a general 
framework that can be adapted to other attack scenarios. 
To this end, the adoption of a graph representation can 
potentially enable us to model the correlations of a wide 
class of botnet attacks using various features. Further- 
more, since graphs are powerful representations in many 
tasks such as social network analysis and Web graph min- 
ing, we hope our large-scale implementations can serve 
as an example to benefit a wide class of applications for 
efficiently constructing and analyzing large graphs. 

The rest of the paper is organized as follows. We dis- 
cuss related work in Section 2, and overview the Bot- 
Graph system in Section 3. We then describe in Sec- 
tion 4 the detail algorithms to construct and analyze a 
large user-user graph for attack detection. We present 
the system implementation and performance evaluation 
in Section 5, followed by attack detection results in Sec- 
tion 6. Finally, we discuss attacker countermeasures and 
system generalizations in Section 7. 


2 Background and Related Work 


In this section, we first describe the new attack we focus 
on in our study, and review related work in botnet detec- 
tion and defense. As we use Dryad/DryadLINQ as our 
programming model for analyzing large datasets, we also 
discuss existing approaches for parallel computation on 
computer clusters, particularly those relate to the recent 
cloud computing systems. 


2.1 Spamming Botnets and Their Detection 


The recent Web-account abuse attack was first reported 
in summer 2007 [25], in which millions of botnet email 
accounts were created from major Web email service 
providers in a short duration for sending spam emails. 
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While each user is required to solve a CAPTCHA test 
to create an account, attackers have found ways to by- 
pass CAPTCHAs, for example, redirecting them to ei- 
ther spammer-controlled Web sites or dedicated cheap 
labor *. The solutions are sent back to the bot hosts 
for completing the automated account creation. Tro- 
jan.Spammer.HotLan is a typical worm for such auto- 
mated account signup [25]. Today, this attack is one of 
the major types of large-scale botnet attacks, and many 
large Web email service providers, such as Hotmail, Ya- 
hoo!Mail, and Gmail, are the popular attack targets. To 
our best knowledge, BotGraph is one of the first solutions 
to combat this new attack. 

The Web-account abuse attack is certainly not the first 
type of botnet spamming attacks. Botnet has been fre- 
quently used as a media for setting up spam email servers. 
For example, a backdoor rootkit Spam-Mailbot.c can 
be used to control the compromised bots to send spam 
emails. Storm botnet, one of the most widespread P2P 
botnets with millions of hosts, at its peak, was deemed re- 
sponsible for generating 99% of all spam messages seen 
by a large service provider [9, 19]. 

Although our work primarily focuses on detecting the 
Web-account abuse attack, it can potentially be general- 
ized to detect other botnet spamming attacks. In this gen- 
eral problem space, a number of previous studies have 
all provided us with insights and valuable understanding 
towards the different characteristics of botnet spamming 
activities [1, 11,23, 26]. Among recent work on detecting 
botnet membership [20, 22, 24, 27], SpamTracker [24] 
and AutoRE [27] also aim at identifying correlated spam- 
ming activities and are more closely related with our 
work. In addition to exploiting common features of bot- 
net attacks as SpamTracker and AutoRE do, BotGraph 
also leverages the connectivity structures of the user-user 
relationship graph and explores these structures for bot- 
net account detection. 


2.2 Distributed and Parallel Computing 


There has been decades of research on distributed and 
parallel computing. Massive parallel processing (MPP) 
develops special computer systems for parallel comput- 
ing [15]. Projects such as MPI (Message Passing Inter- 
face) [14] and PVM(Parallel Virtual Machine) [21] de- 
velop software libraries to support parallel computing. 
Distributed database is another large category of parallel 
data processing applications [17]. 

The emergence of cloud computing models, such as 
MapReduce [6], Hadoop [2], Dryad/DryadLINQ [10, 
29], has enabled us to write simple programs for effi- 
ciently analyzing a vast amount of data on a computer 
cluster. All of them adopt the notion of staged computa- 
tion, which makes scheduling, load balancing, and failure 
recovery automatic. This opens up a plethora of oppor- 
tunities for re-thinking network security—an application 


“Interestingly, solving CAPTCHAs has ended up being a low-wage 
industry [3]. 
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that often requires processing huge volumes of logs or 
trace data. Our work is one of the early attempts in this 
direction. 

While all of these recent parallel computing models of- 
fer scalability to distributed applications, they differ in 
programming interfaces and the built-in operation prim- 
itives. In particular, MapReduce and Hadoop provide 
two simple functions, Map and Reduce, to facilitate data 
partitioning and aggregation. This abstraction enables 
applications to run computation on multiple data parti- 
tions in parallel, but is difficult to support other com- 
mon data operations such as database Join. To overcome 
this shortcoming, Map-Reduce-Merge [28] introduces a 
Merge phase to facilitate the joining of multiple hetero- 
geneous datasets. More recent scripting languages, such 
as Pig Latin [16] and Sawzall [18], wrap the low level 
MapReduce procedures and provide high-level SQL-like 
query interfaces. Microsoft Dryad/DryadLINQ [10, 29] 
offers further flexibility. It allows a programmer to write 
a simple C# and LINQ program to realize a large class of 
computation that can be represented as a DAG. 

Among these choices, we implemented BotGraph us- 
ing Dryad/DryadLINQ, but we also consider our process- 
ing flow design using the more widely used MapReduce 
model and compare the pros and cons. In contrast to 
many other data-centric applications such as sorting and 
histogram computation, it is much more challenging to 
decompose graph construction for parallel computation 
in an efficient manner. In this space, BotGraph serves 
as an example system to achieve this goal using the new 
distributed computing paradigm. 


3 BotGraph System Overview 


Our goal is to capture spamming email accounts used by 
botnets. As shown in Figure 1, BotGraph has two com- 
ponents: aggressive sign-up detection and stealthy bot- 
user detection. Since service providers such as Hotmail 
limit the number of emails an account can send in one 
day, a spammer would try to sign up as many accounts 
as possible. So the first step of BotGraph is to detect ag- 
gressive signups. The purpose is to limit the total number 
of accounts owned by a spammer. As a second step, Bot- 
Graph detects the remaining stealthy bot-users based on 
their /ogin activities. With the total number of accounts 
limited by the first step, spammers have to reuse their ac- 
counts, resulting in correlations among account logins. 
Therefore BotGraph utilizes a graph based approach to 
identify such correlations. Next, we discuss each compo- 
nent in detail. 


3.1 Detection of Aggressive Signups 


Our aggressive signup detection is based on the premise 
that signup events happen infrequently at a single IP ad- 
dress. Even for a proxy, the number of users signed up 
from it should be roughly consistent over time. A sud- 
den increase of signup activities is suspicious, indicating 
that the IP address may be associated with a bot. We use 
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Figure 1: The Architecture of BotGraph. 


a simple EWMA (Exponentially Weighted Moving Av- 
erage) [13] algorithm to detect sudden changes in signup 
activities. This method can effectively detect over 20 mil- 
lion bot-users in 2 months (see Appendix A for more de- 
tails on EWMA). We can then apply adaptive throttling to 
rate limit account-signup activities from the correspond- 
ing suspicious IP addresses. 

One might think that spammers can gradually build up 
an aggressive signup history for an IP address to evade 
EW MaA-based detection. In practice, building such a his- 
tory requires a spammer to have full control of the IP 
address for a long duration, which is usually infeasible 
as end-users control the online/offline switch patterns of 
their (compromised) computers. The other way to evade 
EW MaA-based detection is to be stealthy. In the next sec- 
tion we will introduce a graph based approach to detect 
stealthy bot-users. 


3.2 Detection of Stealthy Bot-accounts 


Our second component detects the remaining stealthy 
bot-accounts. As a spammer usually controls a set of bot- 
users, defined as a a bot-user group, these bot-users work 
in a collaborative way. They may share similar login or 
email sending patterns because bot-masters often manage 
all their bot-users using unified toolkits. We leverage the 
similarity of bot-user behavior to build a user-user graph. 
In this graph, each vertex is a user. The weight for an 
edge between two vertices is determined by the features 
we use to measure the similarity between the two vertices 
(users). By selecting the appropriate features for similar- 
ity measurement, a bot-user group will reveal itself as a 
connected component in the graph. 

In BotGraph, we use the number of common IP ad- 
dresses logged in by two users as our similarity fea- 
ture (i.e., edge weight). This is because the aggres- 
sive account-signup detection limits the number of bot- 
accounts a spammer may obtain. In order to achieve a 
large spam-email throughout, each bot-account will log 
in and send emails multiple times at different locations, 
resulting in the sharing of IP addresses as explained be- 
low: 


e The sharing of one IP address: For each spammer, 
the number of bot-users is typically much larger than 
the number of bots. Our data analysis shows that on 
each day, the average number of bot-users 1s about 
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50 times more than the number of bots. So multiple 
bot-users must log in from a common bot, resulting 
in the sharing of a common IP address. 


e The sharing of multiple IP addresses: We found 
that botnets may have a high churn rate. A bot 
may be quarantined and leave the botnet, and new 
bots may be added. An active bot may go offline 
and it is hard to predict when it will come back on- 
line. To maximize the bot-account utilization, each 
account needs to be assigned to different bots over 
time. Thus a group of bot-accounts will also share 
multiple IP addresses with a high probability. 


Our BotGraph system leverages the two aforemen- 
tioned IP sharing patterns to detect bot-user activities. 

Note that with dynamic IP addresses and proxies, nor- 
mal users may share IP addresses too. To exclude such 
cases, multiple shared IP addresses in the same Au- 
tonomous System (AS) are only counted as one shared 
IP address. In the rest of this paper, we use the number of 
“shared IP addresses” to denote the the number of ASes 
of the shared IP addresses. It is very rare to have a group 
of normal users that always coincidentally use the same 
set of IP addresses across different domains. Using the 
AS-number metric, a legitimate user on a compromised 
bot will not be mistakenly classified as a bot-user because 


their number of “shared IPs” will be only one °. 


4 Graph-Based Bot-User Detection 


In this section we introduce random graph models to 
analyze the user-user graph. We show that bot-user 
groups differentiate themselves from normal user groups 
by forming giant components in the graph. Based on the 
model, we design a hierarchical algorithm to extract such 
components formed by bot-users. Our overall algorithm 
consists of two stages: 1) constructing a large user-user 
graph, 2) analyzing the constructed graph to identify bot- 
user groups. Note one philosophy we use is to analyze 
group properties instead of single account properties. For 
example, it may be difficult to use email-sending statistics 
for individual bot-account detection (each bot account 
may send a few emails only), but it is very effective to 
use the group statistics to estimate how likely a group 
of accounts are bot-accounts (e.g., they all sent a similar 
number of emails). 


4.1 Modeling the User-User Graph 


The user-user graph formed by bot-users is drastically 
different from the graph formed by normal users: bot- 
users have a higher chance of sharing IP addresses and 
thus more tightly connected in the graph. Specifically, 
we observed the bot-user subgraph contains a giant con- 
nected component—a group of connected vertices that 
occupies a significant portion of the subgraph, while 


3We assume majority of hosts are physically located in only one AS. 
We discuss how to prune legitimate mobile users in Section 4.2.2. 
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the normal-user subgraph contains only isolated vertices 
and/or very small connected components. We introduce 
the random graph theory to interpret this phenomenon 
and to model the giant connected components formed by 
bot-users. The theory also serves as a guideline for de- 
signing our graph-based bot-user detection algorithm. 


4.1.1 Giant Component in User-User Graph 


Let us first consider the following three typical strategies 
used by spammers for assigning bot-accounts to bots, and 
examine the corresponding user-user graphs. 


e Bot-user accounts are randomly assigned to bots. Ob- 
viously, all the bot-user pairs have the same probability 
p to be connected by an edge. 


e The spammer keeps a queue of bot-users (1.e., the 
spammer maintains all the bot-users in a predefined 
order). The bots come online in a random order. Upon 
request from a bot when it comes online, the spammer 
assigns to the requesting bot the top k available (cur- 
rently not used) bot-users in the queue. To be stealthy, 
a bot makes only one request for k bot-users each day. 


e The third case is similar to the second case, except that 
there is no limit on the number of bot-users a bot can 
request for one day and that k = 1. Specifically, a 
bot requests one bot-account each time, and it asks for 
another account after finishing sending enough spam 
emails using the current account. 


We simulate the above typical spamming strategies and 
construct the corresponding user-user graph. In the simu- 
lation, we have 10,000 spamming accounts (n = 10, 000) 
and 500 bots in the botnet. We assume all the bots are ac- 
tive for 10 days and the bots do not change IP addresses. 
In model 2, we pick k = 20. In model 3, we assume the 
bots go online with a Poisson arrival distribution and the 
length of bot online time fits a exponential distribution. 
We run each simulation setup 10 times and present the 
average results. 

Figure 2 shows the simulation results. We can see that 
there is a sharp increase of the size of the largest con- 
nected component as the threshold 7° decreases (i.e., the 
probability of two vertices being connected increases). In 
other words, there exists some transition point of 7’. If 7’ 
is above this transition point, the graph contains only iso- 
lated vertices and/or small components. Once T’ crosses 
the transition point, the giant component “suddenly” ap- 
pears. Note that different spamming strategies may lead 
to different transition values. Model 2 has a transition 
value of 7’ = 2, while Model | and 3 have the same tran- 
sition value of T’ = 3. 

Using email server logs and a set of known botnet ac- 
counts provided by the Hotmail operational group, we 
have confirmed that generally bot-users are above the 
transition point of forming giant components, while nor- 
mal users usually cannot form large components with 
more than 100 nodes. 
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Figure 2: The size of the largest connected component. 


4.1.2 Random Graph Theory 


The sudden appearance of a giant subgraph component 
after a transition point can be interpreted by the theory of 
random graphs. 

Denote G(n,p) as the random graph model, which 
generates a n-vertex graph by simply assigning an edge 
to each pair of vertices with probability p € (0,1]. We 
call the generated graph an instance of the model G'(n, p). 
The parameter p determines when a giant connected com- 
ponent will appear in the graph generated by G(n, p). 
The following property is derived from theorems in [7, 
p.65~67]: 


Theorem 1 A graph generated by G(n,p) has average 
degreed = n-p. If d < 1, then with high probabil- 
ity the largest component in the graph has size less than 
O(logn). Ifd > 1, with high probability the graph will 
contain a giant component with size at the order of O(n). 


For a group of bot-users that share a set of IPs, the av- 
erage degree will be larger than one. According to the 
above theorem, the giant component will appear with a 
high probability. On the other hand, normal users rarely 
share IPs, and the average degree will be far less than 
one when the number of vertices is large. The resulted 
graph of normal users will therefore contain isolated ver- 
tices and/or small components, as we observe in our case. 
In other words, the theorem interprets the appearance of 
giant components we have observed in subsection 4.1.1. 
Based on the theorem, the sizes of the components can 
serve as guidelines for bot-user pruning and grouping 
(discussed in subsection 4.2.2 and 4.2.3). 


4.2 Bot-User Detection Algorithm 


As we have shown in section 4.1, a bot-user group forms a 
connected component in the user-user graph. Intuitively 
one could identify bot-user groups by simply extracting 
the connected components from the user-user graph gen- 
erated with some predefined threshold T’ (the least num- 
ber of shared IPs for two vertices to be connected by an 
edge). In reality, however, we need to handle the follow- 
ing issues: 


e Itis hard to choose a single fixed threshold of 7’. As we 
can see from Figure 2, different spamming strategies 
may lead to different transition points. 


e Bot-users from different bot-user groups may be in the 
Same connected component. This happens due to: 1) 
bot-users may be shared by different spammers, and 2) 
a bot may be controlled by different spammers. 

e There may exist connected components of normal 
users. For example, mobile device users roaming 
around different locations will be assigned IP ad- 
dresses from different ASs, and therefore appeared as 
a connected component. 


To handle these problems, we propose a hierarchical 
algorithm to extract connected components, followed by 
a pruning and grouping procedure to remove false posi- 
tives and to separate mixed bot-user groups. 


4.2.1 Hierarchical Connected-Component 
Extraction 


Algorithm 1 describes a_ recursive function 
Group_Extracting that extracts a set of connected 
components from a user-user graph in a hierarchical 
way. Having such a recursive process avoids using a 
fixed threshold 7’, and is potentially robust to different 
spamming strategies. 

Using the original user-user graph as input, Bot- 
Graph begins with applying Group_Extracting(G, T) to 
the graph with 7’ = 2. In other words, the algorithm first 
identifies all the connected components with edge weight 
w > 2. It then recursively increases w to extract con- 
nected subcomponents. This recursive process continues 
until the number of nodes in the connected component 
is smaller than a pre-set threshold / (VM = 100 in our 
experiments). The final output of the algorithm is a hier- 
archical tree of the connected components with different 
edge weights. 


procedure Group_Extracting(G, 7’) 

1 Remove all the edges with weight w < JT’ from G 
and suppose we get G’; 

2 Find out all the connected subgraphs Gj, Go, --- 
G k in G’; 

3 for 7=1:k do 


Let |G;,| be the number of nodes in G;; 
if |G;,| > M then 
Output G; as a child node of G ; 
Group_Extracting(G;, 7’ + 1); 
end 
end 


Algorithm 1: A Hierarchical algorithm for connected 
component extraction from a user-user graph. 


4.2.2 Bot-User Pruning 


For each connected component output by Algorithm 1, 
we want to compute the level of confidence that the set 
of users in the component are indeed bot-users. In par- 
ticular, we need to remove from the tree (output by Al- 
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Figure 3: Histograms of (1) number of emails sent per day and 
(2) email size. First row: aggressive bot-users; second row: 
normal users. 


gorithm 1) the connected components involving mostly 
legitimate/normal users. 

A major difference between normal users and bot-users 
is the way they send emails. More specifically, normal 
users usually send a small number of emails per day on 
average, with different email sizes. On the other hand, 
bot-users usually send many emails per day, with iden- 
tical or similar email sizes, as they often use a common 
template to generate spam emails. It may be difficult to 
use such differences in email-sending statistics to classify 
bot-accounts individually. But when a group of accounts 
are viewed in aggregate, we can use these statistics to es- 
timate how likely the entire group are bot-users. To do so, 
for each component, BotGraph computes two histograms 
from a 30-day email log: 


e /,,: the numbers of emails sent per day per user. 
e /iz: the sizes of emails. 


Figure 3 shows two examples of the above two his- 
tograms, one computed from a component consisting of 
bot-users (the first row), the other from a component of 
normal users (the second row). The distributions are 
clearly different. Bot-users in a component sent out a 
larger number of emails on average, with similar email 
sizes (around 3K bytes) that are visualized as the peak in 
the email-size histogram. Most normal users sent a small 
number of emails per day on average, with email sizes 
distributing more uniformly. BotGraph normalizes each 
histogram such that its sum equals to one, and computes 
two statistics, s; and so, from the normalized histograms 
to quantify their differences: 


@ s;: the percentage of users who sent more than 3 
emails per day; 

@ so: the areas of peaks in the normalized email-size his- 
togram, or the percentage of users who sent out emails 
with a similar size. 
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Since the histograms are normalized, both s; and s2 
are in the range of [0,1] and can be used as confidence 
measures. A large confidence value means that the major- 
ity of the users in the connected component are bot-users. 
We use only s; to choose the candidates of bot-user com- 
ponents, because s; represents a more robust feature. We 
use Sq together with other features (e.g., account naming 
patterns) for validation purpose only (see Section 6). 

In the pruning process, BotGraph traverses the tree out- 
put by Algorithm 1. For each node in the tree, it computes 
s , the confidence measure for this node to be a bot-user 
component, and removes the node if s; is smaller than a 
threshold S. In total, fewer than 10% of Hotmail accounts 
sent more than 3 emails per day, so intuitively, we can set 
the threshold S = 0.1. In order to minimize the number 
of false positive users, we conservatively set the threshold 
S = 0.8, i.e., we only consider nodes where at least 80% 
of users sent more than 3 emails per day as suspicious 
bot-user groups (discussed further in Section 6.2). 


4.2.3 Bot-User Grouping 


After pruning, a candidate connected-component may 
contain two or more bot-user groups. BotGraph proceeds 
to decompose such components further into individual 
bot-user groups. The correct grouping is important for 
two reasons: 


e We can extract validation features (e.g., s2 mentioned 
above and patterns of account names) more accurately 
from individual bot-user groups than from a mixture 
of different bot-user groups. 

e Administrators may want to investigate and take differ- 
ent actions on different bot-user groups based on their 
behavior. 


We use the random graph model to guide the process of 
selecting the correct bot-user groups. According to the 
random graph model, the user-user subgraph of a bot-user 
group should consist of a giant connected-component 
plus very small components and/or isolated vertices. So 
BotGraph traverses the tree again to select tree nodes that 
are consistent with such random graph property. For each 
node V being traversed, there are two cases: 


e V’s children contain one or more giant components 
whose sizes are O(N), where N is the number of users 
in node V; 
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e V’s children contain only isolated vertices and/or 
small components with size of O(log(V)). 


For case 1, we recursively traverse each subtree rooted by 
the giant components. For case 2, we stop traversing the 
subtree rooted at the V. Figure 4 illustrates the process. 
Here the root node R is decomposed into two giant com- 
ponents A and B. B is further decomposed into another 
two giant components D and FE, while A is decomposed 
into one giant component C’. The giant component dis- 
appears for any further decomposition, indicated by the 
dash-lines. According to the theory, A, C’, D, and EF are 
bot-user groups. If a node is chosen as a bot-user group, 
the sub-tree rooted at the chosen node is considered be- 
longing to the same bot-user group. That is, if we pick A, 
we disregard its child C’ as it is a subcomponent of A. 


5 Large-scale Parallel Graph Construction 


The major challenge in applying BotGraph is the con- 
struction of a large user-user graph from the Hotmail 
login data — the first stage of our graph-based analysis 
described in Section 3.2. Each record in the input log 
data contains three fields: UserID, I[PAddress, and Login- 
Timestamp. The output of the graph construction is a list 
of edges in the form of UserID,, UserID2, and Weight. 
The number of users on the graph is over 500 million 
based on a month-long login data (220 GB), and this 
number is increasing as the Hotmail user population is 
growing. The number of edges of the computed graph is 
on the order of hundreds of billions. Constructing such 
a large graph using a single computer is impractical. An 
efficient, scalable solution is required so that we could 
detect attacks as early as possible in order to take timely 
reactive measures. 

For data scalability, fault tolerance, and ease of pro- 
gramming, we choose to implement BotGraph using 
Dryad/DryadLINQ, a powerful programming environ- 
ment for distributed data-parallel computing. How- 
ever, constructing a large user-user graph _ using 
Dryad/DryadLINQ is non-trivial. This is because the 
resulting graph is extremely /arge, therefore a straight- 
forward parallel implementation is inefficient in perfor- 
mance. In this section, we discuss in detail our solu- 
tions. We first present both a simple parallelism method 
and a selective filtering method, and then describe sev- 
eral optimization strategies and their performance im- 
pacts. We also discuss several important issues arising 
in the system implementation, such as data partitioning, 
data processing flow, and communication methods. Us- 
ing a one-month log as input, our current implementation 
can construct a graph with tens of millions of nodes in 1.5 
hours using a 240-machine cluster. During this process, 
BotGraph filters out weight one edges, and the remaining 
number of edges for the next-stage processing is around 
8.6 billion. 

We also implemented the second stage of finding con- 
nected components using Dryad/DryadLINQ. This stage 
can be solved using a divide and conquer algorithm. In 
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Figure 5: Process flow of Method 1. 


particular, one can divide the graph edges into multi- 
ple partitions, identify the connected subgraph compo- 
nents in each partition, and then merge the incomplete 
subgraphs iteratively. To avoid overloading the merging 
node, instead of sending all outputs to a single merging 
node, each time we merge two results from two parti- 
tions. This parallel algorithm is both efficient and scal- 
able. Using the same 240-machine cluster in our experi- 
ments, this parallel algorithm can analyze a graph with 
8.6 billion edges in only 7 minutes — 34 times faster 
than the 4 hour running time by a single computer. Given 
our performance bottleneck is at the first stage of graph 
construction instead of graph analysis, we do not further 
elaborate this step. 


5.1 Two Implementation Methods 


The first step in data-parallel applications is to partition 
data. Based on the ways we partition the input data, 
we have different data processing flows in implementing 
graph construction. 


5.1.1 Method 1: Simple Data Parallelism 


Our first approach is to partition data according to IP ad- 
dress, and then to leverage the well known Map and Re- 
duce operations to straightforwardly convert graph con- 
struction into a data-parallel application. 

As illustrated in Figure 5, the input dataset is parti- 
tioned by the user-login IP address (Step 1). During the 
Map phase (Step 2 and 3), for any two users U; and U; 
sharing the same IP-day pair, where the IP address is 
from Autonomous System AS;, we output an edge with 
weight one e =(U;, U;, AS;,). Only edges pertaining to 
different ASes need to be returned (Step 3). To avoid out- 
putting the same edge multiple times, we use a local hash 
table to filter duplicate edges. 

After the Map phase, all the generated edges (from all 
partitions) will serve as inputs to the Reduce phase. In 
particular, all edges will be hash partitioned to a set of 
processing nodes for weight aggregation using (U;, U;) 
tuples as hash keys (Step 4) . Obviously, for those user 
pairs that only share one IP-day in the entire dataset, there 
is only one edge between them. So no aggregation can 
be performed for these weight one edges. We will show 
later in Figure 7 that weight one edges are the dominate 
source of graph edges. Since BotGraph focuses on only 
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Figure 6: Process flow of Method 2. 


edges with weight two and above, the weight one edges 
introduce unnecessary communication and computation 
cost to the system. After aggregation, the outputs of the 
Reduce phase are graph edges with aggregated weights. 


5.1.2 Method 2: Selective Filtering 


An alternative approach 1s to partition the inputs based on 
user ID. In this way, for any two users that were located in 
the same partition, we can directly compare their lists of 
[P-day pairs to compute their edge weight. For two users 
whose records locate at different partitions, we need to 
ship one user’s records to another user’s partition before 
computing their edge weight, resulting in huge commu- 
nication costs. 

We notice that for users who do not share any IP-day 
keys, such communication costs can be avoided. That 
is, we can reduce the communication overhead by se- 
lectively filtering data and distributing only the related 
records across partitions. 

Figure 6 shows the processing flow of generating user- 
user graph edges with such an optimization. For each 
partition p;, the system computes a local summary s; to 
represent the union of all the IP-day keys involved in this 
partition (Step 2). Each local summary s; 1s then dis- 
tributed across all nodes for selecting the relevant input 
records (Step 3). At each partition p;(j # 7%), upon re- 
ceiving s;, p; will return all the login records of users 
who shared the same [P-day keys in s;. This step can be 
further optimized based on the edge threshold w: if a user 
in p; shares fewer than w IP-day keys with the summary 
s;, this user will not generate edges with weight at least 
w. Thus only the login records of users who share at least 
w IP-day keys with s; should be selected and sent to par- 
tition p; (Step 4)). To ensure the selected user records will 
be shipped to the right original partition, we add an ad- 
ditional label to each original record to denote their par- 
tition ID (Step 7). Finally, after partition p; receives the 
records from partition p;, it joins these remote records 
with its local records to generate graph edges (Step 8 and 
9). 

Other than Map and Reduce, this method requires two 
additional programming interface supports: the operation 
to join two heterogeneous data streams and the operation 
to broadcast a data stream. 
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Figure 7: Edge weight distribution. 


5.1.3. Comparison of the Two Methods 


In general, Method 1 is simple and easy to implement, 
but Method 2 is more optimized for our application. The 
main difference between the two data processing flows is 
that Method 1 generates edges of weight one and sends 
them across the network in the Reduce phase, while 
Method 2 directly computes edges with weight w or 
more, with the overhead of building a local summary and 
transferring the selected records across partitions. Fig- 
ure 7 shows the distribution of edge weights using one- 
month of user login records as input. Here, the number 
of weight one edges is almost three orders of magnitude 
more than the weight two edges. In our botnet detection, 
we are interested in edges with a minimum weight two 
because weight one edges do not show strong correlated 
login activities between two users. Therefore the com- 
putation and communication spent on generating weight 
one edges are not necessary. Although in Method 1, Step 
3 can perform local aggregation to reduce the number 
of duplicated weight one edges, local aggregation does 
not help much as the number of unique weight one edges 
dominates in this case. 

Given our implementation is based on the existing 
distributed computing models such as MapReduce and 
DryadLINQ, the amount of intermediate results impacts 
the performance significantly because these program- 
ming models all adopt disk read/write as cross-node com- 
munication channels. Using disk access as communica- 
tion is robust to failures and easy to restart jobs [6, 29]. 
However, when the communication cost is large such as 
in our case, it becomes a major bottleneck of the over- 
all system running time. To reduce this cost, we used a 
few optimization strategies and will discuss them in the 
next subsection. Completely re-designing or customizing 
the underlying communication channels may improve the 
performance in our application, but is beyond the scope 
of this paper. 

Note the amount of cross-node communication also 
depends on the cluster size. Method | results in a constant 
communication overhead, 1.e., the whole edge set, regard- 
less of the number of data partitions. But for Method 
2, when the number of computers (hence the number of 
data partitions) increases, both the aggregated local sum- 
mary size and the number of user-records to be shipped 
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(a) Serial merge 


(b) Parallel merge 


Figure 8: (a) Default query execution plan (b) Optimized query 
execution plan. 


increase, resulting in a larger communication overhead. 
In the next subsections, we present our implementations 
and evaluate the two different methods using real-data ex- 
periments. 


5.2 Implementations and Optimizations 


In our implementation, we have access to a 240-machine 
cluster. Each machine is configured with an AMD Dual 
Core 4.3G CPU and 16 GB memory. As a pre-processing 
step, all the input login records were hash partitioned 
evenly to the computer cluster using the DryadLINQ 
built-in hash-partition function. 


Given the Hotmail login data is on the order of hun- 
dreds of Gigabytes, we spent a number of engineering 
efforts to reduce the input data size and cross-node com- 
munication costs. The first two data reduction strategies 
can be applied to both methods. The last optimization is 
customized for Method 2 only. 


1. User pre-filtering: We pre-filter users by their lo- 
gin AS numbers: if a user has logged in from IP addresses 
across multiple ASes in a month, we regard this user as 
a suspicious user candidate. By choosing only suspicious 
users (using 2 ASes as the current threshold) and their 
records as input, we can reduce the number of users to 
consider from over 500 million (about 200-240GB) to 
about 70 million (about 1OOGB). This step completes in 
about 1-2 minutes. 


2. Compression: Given the potential large communi- 
cation costs, BotGraph adopts the DryadLINQ provided 
compression option to reduce the intermediate result size. 
The use of compression can reduce the amount of cross- 
node communication by 2-2.5 times. 


3. Parallel data merge: In Method 2, Step 3 merges 
the local [P-day summaries generated from every node 
and then broadcasts the aggregated summary to the entire 
cluster. The old query plan generated by DryadLINQ is 
shown in Figure 8 (a), where there exists a single node 
that performs data aggregation and distribution. In our 
experiments, this aggregating node becomes a big bot- 
tleneck, especially for a large cluster. So we modified 
DryadLINQ to generate a new query plan that supports 
parallel data aggregation and distribution from every 
processing node (Figure 8 (b)). We will show in Sec- 
tion 5.3 that this optimization can reduce the broadcast 
time by 4-5 times. 


|__| | Communication data size | Total running time 
MahodT|| —_120TB 


2.07B 
Method? || LTB 


Table 1: Performance comparison of the two methods using the 
2008 -dataset. 





[I Communication data size [Total ranning time| 
[Method T(mocompy [| _271TB «| ——*1a5min —_—| 
[Method T(with comp || 1.02 TB 
[Method (no comp.) [| 460 GB 


Method 2 (with comp.) | 181 GB 





Table 2: Performance comparison of the two methods using a 
subset of the 2008-dataset. 


5.3. Performance Evaluation 


In this section, we evaluate the performance of our im- 
plementations using a one-month Hotmail user-login log 
collected in Jan 2008 (referred to as the 2008-dataset). 
The raw input data size is 221.5 GB, and after pre- 
filtering, the amount of input data is reduced to 102.9 
GB. To use all the 240 machines in the cluster, we gen- 
erated 960 partitions to serve as inputs to Method 1 (so 
that the computation of each partition fits into memory), 
and generated 240 partitions as inputs to Method 2. With 
compression and parallel data merge both enabled, our 
implementation of Method 2 finishes in about 1.5 hours 
using all the 240 machines, while Method | cannot finish 
within the maximum 6 hour quota allowed by the com- 
puter cluster (Table 1). The majority of time in Method 
1 is spent on the second Reduce step to aggregate a huge 
volume of intermediate results. For Method 2, the local 
summary selection step generated about 5.8 GB aggre- 
gated [P-day pairs to broadcast across the cluster, result- 
ing 1.35 TB out of the 1.7 TB total traffic. 

In order to benchmark performance, we take a smaller 
dataset (about 1/5 of the full 2008-dataset) that Method 
1 can finish within 6 hours. Table 2 shows the commu- 
nication costs and the total running time using the 240 
machine cluster. While Method 1 potentially has a better 
scalability than Method 2 as discussed in Section 5.1.3, 
given our practical constraints on the cluster size, Method 
2 generates a smaller amount of traffic and outperforms 
Method 1 by about 5-6 times faster. The use of compres- 
sion reduces the amount of traffic by about 2-3 times, and 
the total running time is about 14-25% faster. 

To evaluate the system scalability of Method 2, we 
vary the number of data partitions to use different num- 
ber of computers. Figure 9 shows how the communica- 
tion overheads grow. With more partitions, the amount 
of data generated from each processing node slightly de- 
creases, but the aggregated local summary data size in- 
creases (Figure 9 (a)). This is because popular IP-day 
pairs may appear in multiple data partitions and hence 
in the aggregated summary multiple times. Similarly, 
the same user login records will also be shipped across 
a larger number of nodes, increasing the communication 
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Figure 9: Communication data size as we vary the number of 
input data partitions (a) Local summary size in terms of the 
number of IP-day keys. (b) Total number of selected user lo- 
gin records to be sent across the network. 
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Figure 10: Running time as we vary the number of input data 
partitions for Method 2. (a) Total running time of all partitions. 


(b) The running of each partition. The error bars show the max 
and the min running time across all partitions. 


costs as the system scales (Figure 9 (b)). 

Even though the communication costs increase, the to- 
tal running time is still reduced with a larger cluster size. 
Figure 10 (a) shows the total running time and its break- 
down across different steps. When the cluster size is 
small (10 partitions), a dominant amount of time is spent 
on computing the graph edges. As the system scales, this 
portion of time decreases sharply. The other three steps 
are I/O and network intensive. Their running time slightly 
decreases as we increase the number of partitions, but the 
savings get diminished due to the larger communication 
costs. Figure 10 (b) shows the average running time spent 
on processing each partition, and its variations are very 
small. 

We now examine the benefits of adopting parallel data 
merge. The purpose of parallel data merge is to remove 
the bottleneck node that performs data aggregation and 
broadcasting. Since it is difficult to factor out the network 
transfer time savings alone (network, disk I/O, and com- 
putation are pipelined), we compare the time spent on the 
user record selection step (Figure 11 (a)). This optimiza- 
tion can reduce the processing latency significantly as the 
cluster size increases (75% reduction in the 200 node sce- 
nario). Without parallel data merge, the processing time 
increases almost linearly, but with this optimization, the 
amount of time remains roughly constant. 

For Method 2, one reason for the large communica- 
tion costs is that for botnet users, their graph component 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 





User record selection time (minutes) 











MI With parallel merge 
Without parallel merge 











HE Minimum time 
350 Maximum time 







































































10 200 





50 100 150 =a 
Number of partitions random partition strategic partition 
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Figure 11: (a) The processing time of user-record selection with 
and without parallel data merge. (b) Minimal and maximum 
running time of partitions with and without strategic data parti- 
tioning. 


is both large and dense. Therefore, one potential opti- 
mization technique is to strategically partition the login 
records. Intuitively, we can reduce the communication 
costs if we pre-group users so that users who are heav- 
ily connected are placed in one partition, and users who 
are placed in different partitions have very few edges be- 
tween them. If so, Step 4 in Method 2 will return only 
a small number of records to ship across different nodes. 
Surprisingly, we found this strategy actually induced neg- 
ative impact on the system performance. 

Figure 11 (b) shows the graph construction time spent 
at a processing node with and without strategic data par- 
titioning. We chose the 240 input data partition scenario 
and use the full dataset to illustrate the performance dif- 
ference. In the first case, we evenly distributed login 
records by hashing user IDs. In the second case, we 
chose a large botnet user group with 3.6M users and put 
all their login records evenly across 5 partitions, with the 
remaining data evenly distributing across the remaining 
partitions. This scenario assumes the best prior know]- 
edge of user connections. Although in both cases, the 
total amount of input data in each partition is roughly uni- 
form, we observe a big difference between the maximum 
and minimum time in computing the edges across nodes. 
Without strategic partitioning, the maximum and mini- 
mum processing time is very close. In contrast, strategic 
partitioning caused a huge degree of unbalance in work- 
load, resulting in much longer total job running time. 


6 Bot-user Detection and Validation 


We use two month-long datasets as inputs to our system: 
a 2007-dataset collected in Jun 2007, and a 2008-dataset 
collected in Jan 2008. Each dataset includes two logs: a 
Hotmail login log (format described in Section 5) and a 
Hotmail signup log. Each record in the signup log con- 
tains a user-ID, the remote IP address used for signup, 
and the signup timestamp. For each dataset, we run our 
EWMA-based anomaly detection on the signup log and 
run our graph based detection on the login log. Using 
both components, BotGraph detected tens of millions of 
bot users and millions of botnet IPs. Table 3 summarizes 
the results for both months. We present the detailed re- 
sults and perform evaluations next. 
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Figure 12: (a) Cumulative distribution of anomaly window size 
in terms of number of days. (b) Cumulative distribution of the 
number of accounts signed up per suspicious IP. 






































0.55 
0 





6.1 Detection Using Signup History 


Table 4 shows that the EWMA algorithm detected 21.2 
million bot-user accounts when applied to the two Hot- 
mail signup logs. Comparing Jan 2008 with Jun 2007, 
both the number of bot IPs and the signed-up bot-users 
increased significantly. In particular, the total number of 
bot-accounts signed up in Jan 2008 is more than three 
times the number in Jun 2007. Meanwhile, the anomaly 
window is shortened from an average of 1.45 days to 1.01 
days, suggesting each attack became shorter in Jan 2008. 

Figure 12 (a) shows the cumulative distribution of the 
anomaly window sizes associated with each bot IP ad- 
dress. A majority (80% - 85%) of the detected IP ad- 
dresses have small anomaly windows, ranging from a few 
hours to one day, suggesting that many botnet signup at- 
tacks happened in a burst. 

Figure 12 (b) shows the cumulative distributions of the 
number of accounts signed up per bot IP. As we can see, 
the majority of bot IPs signed up a large number of ac- 
counts, even though most of them have short anomaly 
windows. Interestingly, the cumulative distributions de- 
rived from Jun 2007 and Jan 2008 overlap well with each 
other, although we observed a much larger number of 
bot IPs and bot-users in Jan 2008. This indicates that 
the overall bot-user signup activity patterns still remain 
similar perhaps due to the reuse of bot-account signup 
tools/software. 


6.2 Detection by User-User Graph 


We apply the graph-based bot-user detection algorithm 
on the Hotmail login log to derive a tree of connected 
components. Each connected component is a set of bot- 
user candidates. We then use the procedures described in 
Section 4.2.2 to prune the connected components of nor- 
mal users. Recall that in the pruning process, we apply 
a threshold on the confidence measure of each compo- 
nent (computed from the “email-per-day” feature) to re- 
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Figure 13: Bot-user group properties: (a) The the number of 
users per group, (b) The peakness score of each group, reflect- 
ing whether there exists a strong sharp peak for the email size 
distribution. 


move normal user components. In our experiments, the 
confidence measures are well separated: most of the bot- 
groups have confidence measures close to 1, and a few 
groups are between 0.4 and 0.6. We observe a wide mar- 
gin around confidence measure of 0.8, which we choose 
as our threshold. As discussed in Section 4.2.2, this is 
a conservative threshold and is in-sensitive to noises due 
to the wide margin. For any group that has a confidence 
measure below 0.8, we regard it as a normal user group 
and prune it from our tree. 

Table 5 shows the final detection results after pruning 
and grouping. Both the number of bot-users and the num- 
ber of bot IP addresses are on the order of millions — a 
non-trivial fraction of all the users and IP addresses ob- 
served by Hotmail. We find the two sets of bot-users 
detected in two months hardly overlap. These accounts 
were stealthy ones, each sending out only a few to tens 
of spam emails during the entire month. Therefore, it is 
difficult to capture them by looking for aggressive send- 
ing patterns. Due to their large population, detecting and 
sanitizing these users are important both to save Hotmail 
resources and to reduce the amount of spam sent to the 
Internet. Comparing Jan 2008 with Jun 2007, the number 
of bot-users tripled, suggesting that using Web portals as 
a spamming media has become more popular. 

Now we study the properties of bot-users at a group 
level. Figure 13 (a) shows that the number of users in 
each group ranges from thousands to millions. Compar- 
ing Jan 2008 with Jun 2007, although the largest bot- 
user group remains similar in size, the number of groups 
increased significantly. This confirms our previous ob- 
servation that spammers are more frequently using Web 
email accounts for spam email attacks. 

We next investigate the email sending patterns of the 
detected bot user groups. We are interested in whether 
there exists a strong peak of email sizes. We use the peak- 
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ness score metric s2 (defined in Section 4.2.2) to quantify 
the degree of email size similarity for each group. Fig- 
ure 13 (b) shows the distributions of sz in sorted order. 
A majority of groups have peakness scores higher than 
0.6, meaning that over 60% of their emails have similar 
sizes. For the remaining groups, we performed manual 
investigation and found they have multiple peaks, result- 
ing in lower scores. The similarity of their email sizes is 
a strong evidence of correlated email sending activities. 


In the next two sub-sections, we explore the quality of 
the total captured 26 million bot-users. First, we examine 
whether they are known bad and how many of them are 
our new findings. Second, we estimate our detection false 
positive rates. 


6.3 Known Bot-users vs. New Findings 


We evaluate our detected bot-users against a set of known 
spammer users reported by other email servers in Jan 
2008 +. 


Denote H as the set of bot-users detected by signup 
history using EWMA, Kk, as the set of known spam- 
mer accounts signed up in the month that we study, and 
KH as the intersection between H and K,. The ra- 
tio of * pit represents the percentage of captured bot- 
users that are previously known bad. In other words, 
1- Aout is our new findings. The ratio of fat de- 
notes the recall of our approach. Table 6 shows that, in 
Jun 2007, 85.15% of the EWMA-detected bot-user de- 
tected are already known bad, and the detected bot-user 
covers a significant fraction of bad account, 1.e., recall = 
67.96%. Interestingly, Jan 2008 yields quite different 
results. EWMA is still able to detect a large fraction of 
known bad account. However, only 8.17% of detected 
bad-users were reported to be bad. That means 91.83% 


of the captured spamming accounts are our new findings. 





We apply a similar study to the bot-users detected by 
the user-user graph. Denote /) as the set of known spam- 
mers users that log in from at least 2 ASes, L as the set 
of bot-users detected using our user-user graph based ap- 
proach, and AK; M L as the intersect between AK, and L. 
Again we use the ratios of fewk and Aue to evaluate 
our result L, as shown in Table 7. Using our graph- 
based approach, the recall is higher. In total, we were 
able to detect 76.84% and 85.80% of known spammer 
users in Jun 2007 and Jan 2008, respectively. Similar to 
EWMA, the graph-based detection also identified a large 
number (54.10%) of previously unknown bot-accounts in 
Jan 2008. This might be because these accounts are new 
ones and haven’t been used aggressively to send out a 
massive amount of spam emails yet. So, they are not yet 
reported by other mail servers as of Jan 2008. The ability 
of detecting bot-accounts at an early stage is important to 
to give us an upper hand in the anti-spam battle. 


4These users were complained of having sent outbound spam 
emails. 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 














1 if ee — om et * tae 7K KKK 
* ** pe 
Hi 
0.8) 
o 
oO 
oO 
wn 
< 0.6} 
a 
o 
©: 
P04 
Ee 
oO 
z= 
0.21 
0 L i 
10° 10° 10° 


Bot-user group size 


Figure 14: Validation of login-graph detected bot-users using 
naming scores. 


6.4 False Positive Analysis 


In the previous subsection, we analyzed the overlap be- 
tween our results and the set of known bad accounts. For 
the remaining ones, validation is a challenging task with- 
out the ground truth. We examine the following two ac- 
count features to estimate the false positive rates: naming 
patterns and signup dates. 


6.4.1 Naming Patterns 


For the identified groups, we found almost every group 
follows a very clear user-name template, for example, a 
fixed-length sequence of alphabets mixed with digits >. 
Examples of such names are ““w9168d4dc8c5c25f9” and 
“x9550a2 1da4e456a2”. 

To quantify the similarity of account names in a group, 
we introduce a naming pattern score, which 1s defined as 
the largest fraction of users that follow a single template. 
Each template is a regular expression derived by a regular 
expression generation tool [27]. Since many accounts de- 
tected in Jun 2007 were known bad and hence cleaned by 
the system already, we focus on bot-user groups detected 
in Jan 2008. 

Figure 14 shows the naming score distribution. A ma- 
jority of the bot-user groups have close to 1 naming pat- 
tern scores, indicating that they were signed up by spam- 
mers using some fixed templates. There are only a few 
bot-user groups with scores lower than 0.95. We manu- 
ally looked at them and found that they are also bad users, 
but the user names come from two naming templates. 
It is possible that our graph-based approach mixed two 
groups, or the spammers purchased two groups of bot- 
users and used them together. Overall, we found in total 
only 0.44% of the identified bot-users do not strictly fol- 
low the naming templates of their corresponding groups. 


6.4.2 Signup Dates 


Our second false positive estimate is based on examin- 
ing the signup dates of the detected bot-users. Since the 
Web-account abuse attack is recent and started in sum- 
mer 2007, we regard all the accounts signed up before 
2007 as legitimate accounts. Only 0.08% of the identi- 
fied bot-users were signed up before year 2007. To cal- 


Note it is hard to directly use the naming pattern itself to identify 
spamming accounts due to the easy countermeasures. 
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ibrate our results against the entire user population. We 
look at the sign up dates of all users in the input dataset. 
About 59.1% of the population were signed up before 
2007. Assuming the normal user signup-date distribu- 
tions are the same among the overall population and our 
detected user set, we adjust the false positive rate to be 
0.08% /59.1% = 0.13% 

The above two estimations suggest that the false pos- 
itive of BotGraph is low. We conservatively pick the 
higher one 0.44% as our false positive rate estimate. 


7 Discussion 


In this paper, we demonstrated that BotGraph can detect 
tens of millions of bot-users and millions of bots. With 
this information, operators can take remedy actions and 
mitigate the ongoing attacks. For bot-users, operators can 
block their accounts to prevent them from further sending 
spam, or apply more strict policies when they log in (e.g., 
request them to do additional CAPTCHA tests). For de- 
tected bot IP addresses, one approach is to blacklist them 
or rate limit their login activities, depending on whether 
the corresponding IP address is a dynamically assigned 
address or not. Effectively throttling botnet attacks in the 
existence of dynamic IP addresses is ongoing work. 

Attackers may wish to evade the BotGraph detection 
by developing countermeasures. For example, they may 
reduce the number of users signed up by each bot. They 
may also mimic the normal user email-sending behav- 
ior by reducing the number of emails sent per account 
per day (e.g., fewer than 3). Although mimicking normal 
user behavior may evade history-based change detection 
or our current thresholds, these approaches also signifi- 
cantly limit the attack scale by reducing the number of 
bot-accounts they can obtain or the total number of spam 
emails to send. Furthermore, BotGraph can still capture 
the graph structures of bot-user groups from their login 
activity to detect them. 

A more sophisticated evasion approach may bind each 
bot-user to only bots in one AS, so that our current im- 
plementation would pre-filter them by the two AS thresh- 
old. To mitigate this attack, BotGraph may revise the 
edge weight definition to look at the number of IP pre- 
fixes instead of the number of ASes. This potentially 
pushes the attacker countermeasures to be more like a 
fixed [P-account binding strategy. As discussed in Sec- 
tion 3.2, binding each bot-user to a fixed bot is not de- 
sirable to the spammers. Due to the high botnet churn 
rate, it would result in a low bot-user utilization rate. It 
also makes attack detection easier by having a fixed group 
of aggressive accounts on the same IP addresses all the 


Ko See text for the definition of A, and L. 
time. If one of the bot-accounts is captured, the entire 
group can be easily revealed. A more generalized solu- 
tion is to broaden our edge weight definition by consider- 
ing additional feature correlations. For example, we can 
potentially use email sending patterns such as the desti- 
nation domain [24], email size, or email content patterns 
(e.g., URL signatures [27]). As ongoing work, we are 
exploring a larger set of features for more robust attack 
detection. 

In addition to using graphs, we may also consider other 
alternatives to capture the correlated user activity. For 
example, we may cluster user accounts using their login 
IP addresses as feature dimensions. Given the large data 
volume, how to accurately and efficiently cluster user ac- 
counts into individual bot-groups remains a challenging 
research problem. 

It is worth mentioning that the design and imple- 
mentation of BotGraph can be applied in different ar- 
eas for constructing and analyzing graphs. For ex- 
ample, in social network studies, one may want to 
group users based on their buddy relationship (e.g., from 
MSN or Yahoo messengers) and identify community pat- 
terns. Finally, although our current implementations are 
Dryad/DryadLINQ specific, we believe the data process- 
ing flows we propose can be potentially generalized to 
other programming models. 


$8 Conclusion 


We designed and implemented BotGraph for Web mail 
service providers to defend against botnet launched Web- 
account abuse attacks. BotGraph consists of two com- 
ponents: a history-based change-detection component to 
identify aggressive account signup activities and a graph- 
based component to detect stealthy bot-user login activ- 
ities. Using two-month Hotmail logs, BotGraph suc- 
cessfully detected more than 26 million botnet accounts. 
To process a large volume of Hotmail data, BotGraph is 
implemented as a parallel Dryad/DryadLINQ application 
running on a large-scale computer cluster. In this paper, 
we described our implementations in detail and presented 
performance optimization strategies. As general-purpose 
distributed computing frameworks have become increas- 
ingly popular for processing large datasets, we believe 
our experience will be useful to a wide category of appli- 
cations for constructing and analyzing large graphs. 
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A EWMA based Aggressive Signup 
Detection 


Exponentially Weighted Moving Average (EWMA) is a 
well known moving average based algorithm to detect 
sudden changes. EWMA is both simple and effective, 
and has been widely used for anomaly detection [12]. 
Given a time series data, let the observation value at 
time t be Y;. Let S; be the predicted value at time ¢t and 
a (O < a < 1) be the weighting factor, EWMA predicts 


Sy as 
Sp=axY1+(1—a) x Si (1) 
We define the absolute prediction error —; and the rel- 
ative prediction error FR; as: 


Et — Y; _ St, R = Y;/max(Sz, €) (2) 


where € is introduced to avoid the divide-by-zero prob- 
lem. A large prediction error £; or R; indicates a sudden 
change in the time series data and should raise an alarm. 
When the number of new users signed up has dropped to 
the number before the sudden change, the sudden change 
ends. We define the time window between the start and 
the end of a sudden change as the anomaly window. All 
the accounts signed up during this anomaly window are 
suspicious bot-users. 

In our implementation, we consider the time unit of 
a day, and hence F; is the predicted number of daily 
signup accounts. For any IP address, if both EF; > dog 
and Ri > or, we mark day ¢ as the start of its anomaly 
window. From a two-year Hotmail signup log, we derive 
the 99%-tile of the daily number of account signups per 
IP address. To be conservative, We set the threshold 67 
to be twice this number to rule out non-proxy normal IPs. 
For proxies, the relative prediction error is usually a bet- 
ter metric to separate them from bots. It is very rare for a 
proxy to increase its signup volume by 4 times overnight. 
So we conservatively set dp to 4. 
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Unraveling the Complexity of Network Management 
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Abstract 


Operator interviews and anecdotal evidence suggest that 
an operator’s ability to manage a network decreases as 
the network becomes more complex. However, there is 
currently no way to systematically quantify how com- 
plex a network’s design is nor how complexity may im- 
pact network management activities. In this paper, we 
develop a suite of complexity models that describe the 
routing design and configuration of a network in a suc- 
cinct fashion, abstracting away details of the underlying 
configuration languages. Our models, and the complex- 
ity metrics arising from them, capture the difficulty of 
configuring control and data plane behaviors on routers. 
They also measure the inherent complexity of the reach- 
ability constraints that a network implements via its rout- 
ing design. Our models simplify network design and 
management by facilitating comparison between alter- 
native designs for a network. We tested our models 
on seven networks, including four university networks 
and three enterprise networks. We validated the results 
through interviews with the operators of five of the net- 
works, and we show that the metrics are predictive of the 
issues operators face when reconfiguring their networks. 


1 Introduction 


Experience has shown that the high complexity underly- 
ing the design and configuration of enterprise networks 
generally leads to significant manual intervention when 
managing networks. While hard data implicating com- 
plexity in network outages is hard to come by, both anec- 
dotal evidence and operator interviews suggest that more 
complex networks are more prone to failures, and are dif- 
ficult to upgrade and manage. 

Today, there is no way to systematically quantify how 
complex an enterprise configuration is, and to what ex- 
tent complexity impacts key management tasks. Our 
experiments show that simple measures of complexity, 
such as the number of lines in the configuration files, are 
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not accurate and do not predict the number of steps man- 
agement tasks require. 

In this paper, we develop a family of complexity mod- 
els and metrics that do describe the complexity of the de- 
sign and configuration of an enterprise network in a suc- 
cinct fashion, abstracting away all the details of the un- 
derlying configuration language. We designed the mod- 
els and metrics to have the following characteristics: (1) 
They align with the complexity of the mental model 
operators use when reasoning about their network— 
networks with higher complexity scores are harder for 
operators to manage, change or reason about correctly. 
(2) They can be derived automatically from the config- 
uration files that define a network’s design. This means 
that automatic configuration tools can use the metrics to 
choose between alternative designs when, as frequently 
is the case, there are several ways of implementing any 
given policy. 

The models we present in this paper are targeted to- 
ward the Layer-3 design and configuration of enterprise 
networks. As past work has shown [19], enterprises em- 
ploy diverse and intricate routing designs. Routing de- 
sign is central both to enabling network-wide reacha- 
bility and to limiting the extent of connectivity between 
some parts of a network. 

We focus on modeling three key aspects of routing de- 
sign complexity: (1) the complexity behind configuring 
network routers accurately, (2) the complexity arising 
from identifying and defining distinct roles for routers 
in implementing a network’s policy, and (3) the inherent 
complexity of the policies themselves. 

Referential Complexity. To model the complexity of 
configuring routers correctly, we develop the referential 
dependence graph. This models dependencies in the def- 
initions of routing configuration components, some of 
which may span multiple devices. We analyze the graph 
to measure the average number of reference links per 
router, as well as the number of atomic units of routing 
policy in a network and the references needed to config- 
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ure each unit. We argue that the number of steps opera- 
tors take when modifying configuration increases mono- 
tonically with these measures. 

Router Roles. We identify the implicit roles played by 
routers in implementing a network’s policies. We argue 
that networks become more complex to manage, and up- 
dating configurations becomes more challenging, as the 
number of different roles increases or as routers simul- 
taneously play multiple roles in the network. Our algo- 
rithms automatically identify roles by finding routers that 
share similar configurations. 

Inherent Complexity. We quantify the impact of 
the reachability and access control policies on the net- 
work’s complexity. Networks that attempt to implement 
sophisticated reachability policies, enabling access be- 
tween some sets of hosts while denying it between oth- 
ers, are more complex to engineer and manage than net- 
works with more uniform reachability policies. How- 
ever, a network’s policies cannot be directly read from 
the network’s configuration and are rarely available in 
any other machine-readable form. Our paper explains 
how the complexity of the policies can be automatically 
extracted by extending the concept of reachability sets 
first introduced by Xie et al. [27]. Reachability sets iden- 
tify the set of packets that a collection of network paths 
will allow based on the packet filters, access control rules 
and routing/forwarding configuration in routers on path. 
We compute a measure of the inherent complexity of the 
reachability policies by computing differences or vari- 
ability between reachability sets along different paths in 
the network. We develop algorithms based on firewall 
rule-set optimization to compare reachability sets and to 
efficiently perform set operations on them (such as inter- 
section, union and cardinality). 

We validated our metrics through interviews with the 
operators and designers of six, four universities and two 
commercial enterprises. The questionnaires used in these 
interviews can be found online [4]. We also measured 
one other network where we did not have access to oper- 
ators. Through this empirical study of the complexity of 
network designs we found we are able to categorize net- 
works in terms of their complexity using the metrics that 
we define. We also find that the metrics are predictive of 
issues the operators face in running their networks. The 
metrics gave us insights on the structure and function of 
the networks that the operators corroborated. A surpris- 
ing result of the study was uncovering the reasons why 
operator chose the designs they did. 

Given the frequency with which configuration errors 
are responsible for major outages [22], we argue that 
creating techniques to quantify systematically the com- 
plexity of a network’s design is an important first step 
to reducing that complexity. Developing such metrics is 
difficult, as they must be automatically computable yet 
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still enable a direct comparison between networks that 
may be very different in terms of their size and routing 
design. In databases [14], software engineering [21], and 
other fields, metrics and benchmarks have driven the di- 
rection of the field by defining what is desirable and what 
is to be avoided. In proposing these metrics, we hope to 
Start a similar conversation, and we have verified with 
operators through both qualitative and quantitative mea- 
sures that these metrics capture some of the trickiest parts 
of network configuration. 


2 Application to Network Management 


Beyond aiding in an empirical understanding of network 
complexity, we believe that our metrics can augment and 
improve key management tasks. We illustrate a few ex- 
amples that are motivated by our observations. 


Understanding network structure: It is common for 
external technical support staff to be brought in when 
a network is experiencing problems or being upgraded. 
These staff must first learn the structure and function of 
the network before they can begin their work, a daunt- 
ing task given the size of many networks and the lack 
of accurate documentation. As we show in Section 7, 
our techniques for measuring reachability have the side- 
effect of identifying routers which play the same role in 
a network’s design. This creates a summary of the net- 
work, since understanding each role is sufficient to un- 
derstand the purpose of all the similar routers. 


Identify inconsistencies: [Inconsistency in a network 
generally indicates a bug. When most routers fit into a 
small number of roles, but one router is different from 
the others, it probably indicates a configuration or design 
error (especially as routers are often deployed in pairs 
for reasons of redundancy). As we show in Section 6.3, 
when our inherent complexity metric found the reacha- 
bility set to one router to be very different from the set to 
other routers, it pointed out a routing design error. 


What-if analysis: Since our metrics are computed 
from configuration files, and not from a running network, 
proposed changes to the configuration files can be an- 
alyzed before deployment. Should any of the metrics 
change substantially, it is an excellent indication that the 
proposed changes might have unintended consequences 
that should be examined before deployment. 


Guiding and automating network design: Networks 
are constantly evolving as they merge, split, or grow. To- 
day, these changes must be designed by humans using 
their best intuition and design taste. In future work, we 
intend to examine how our complexity metrics can be 
used to direct these design tasks towards simpler designs 
that still meet the objectives of the designer. 
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[Univ-x [19] 9,000_[_N__| 
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Table 1: Studied networks. 


3 Methodology and Background 


Our project began with a review of formal training ma- 
terials for network engineers (e.g., [25, 24]) and inter- 
views with the operators of several networks to under- 
stand the tools and processes they use to manage their 
networks. From these sources, we extracted the “best 
common practices” used to manage the networks. On 
the hypothesis that the use of these practices should be 
discernible in the configuration files of networks that use 
them, we developed models and techniques that tie these 
practices to patterns that can be automatically detected 
and measured in the configurations. 

The remainder of this section describes the networks 
we studied, the best common practices we extracted, and 
a tutorial summary of network configuration in enterprise 
networks. The next sections precisely define our metrics, 
the means for computing them, and their validation. 


3.1 Studied Networks 


We studied a total of seven networks: four university net- 
works and three enterprise networks, as these were the 
networks for which we could obtain configuration files. 
For four of the university networks and two enterprises, 
we were also able to interview the operators of the net- 
work to review some of the results of our analysis and 
validate our techniques. Table 1 shows the key proper- 
ties of the networks. 

Figure 1(a) plots the distribution of configuration file 
sizes for the networks. The networks cluster into three 
groups: Univ-2 and the enterprises consist of relatively 
small files, with 50% of their files being under 500 lines, 
while 90% of the files in Univ-1 and Univ-3 are over 
1,000 lines and Univ-4 has a mix of small and large files. 
As we will see, configuration file size is not a good pre- 
dictor of network complexity, as Univ-2 (small files) is 
among the most complicated networks and Univ-3 (large 
files) among the simplest. 

Figure 1(b) breaks down the lines of configuration by 
type. The networks differ significantly in the fraction 
of their configurations devoted to Packet filters, widely 
known as ACLs, and routing stanzas. Univ-1 and the 
enterprises spend as many configuration lines on routing 
stanzas as on ACLs, while Univ-2, -3 and -4 define pro- 
portionately more ACLs than routing stanzas. Interface 
definitions, routing stanzas, and ACL definitions (the key 
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Figure 1: (a) Distribution of configuration file size across 
networks. (b) Fraction of configuration dedicated to con- 
figuring each aspect of router functionality. 


building blocks for defining layer-3 reachability) account 
for over 60% of the configuration in all networks. 

All the networks used some form of tools to maintain 
their configurations [16, 1]. Most tools are home-grown, 
although some commercial products are in use. Most 
had at least spreadsheets used to track inventory, such as 
IP addresses, VLANs, and interfaces. Some used tem- 
plate tools to generate portions of the configuration files 
by instantiating templates using information from the in- 
ventory database. In the sections that follow, we point 
out where tools helped (and sometimes hurt) operators. 


3.2. Network Design and Configuration 


Based on our discussions with operators and training ma- 
terials, we extract the best common practices that oper- 
ators follow to make it easier to manage their networks. 
Our complexity metrics quantify how well a network ad- 
heres to these strategies, or equivalently, to what extent a 
network deviates from them. 

Uniformity. To the extent possible, operators attempt 
to make their networks as homogeneous as possible. Spe- 
cial cases not only require more thought and effort to 
construct in the first place, but often require special han- 
dling during all future network upgrades. To limit the 
number of special cases operators must cope with, they 
often define a number of archetypal configurations which 
they then reuse any time that special case arises. We call 
these archetypes roles. 

Tiered Structure. Operators often organize their net- 
work devices into tiers to control the complexity of their 
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Interface Vian901 
ip 10.2.1.23 255.255.255.252 
ip access-group 9 out 

! 


Router ospf 1 
router-id 10.1.2.133 
passive-interface default 
no passive-interface Vlan901 
network 10.2.0.0 0.0.255.255 
10 distribute-list in 11 
11 redistribute connected subnets 
12 ! 
13 access-list 9 permit 10.2.1.23 0.0.0.3 any 
14 access-list 9 deny any 
15 access-list 11 permit 10.2.0.0 0.0.255.255 


OONOaRWHD = 


Figure 2: A sample configuration file. 


design. For example, defining some routers to be border 
routers that connect with other networks, some routers 
to be core routers that are densely connected, and the re- 
maining routers as edge routers that connect hosts. 

Short Dependency Chains. Routers cannot be con- 
figured in isolation, as frequently one part of the config- 
uration will not behave correctly unless other parts of the 
configuration, sometimes on other routers, are consistent 
with it. We define this to be a dependency between those 
configuration lines. Operators attempt to minimize the 
number of dependencies in their networks. This is be- 
cause making a change to one configuration file but not 
updating all the other dependent configurations will in- 
troduce a bug. Since the configurations do not explicitly 
declare all their dependencies, operators’ best strategy is 
to minimize the number of dependencies. 


3.3. Overview of a Configuration File 


All our complexity metrics are computed on the basis of 
router configuration files. Before defining the metrics, 
we describe the layout of the configuration file for a net- 
work router and provide an overview of the mechanisms 
(e.g., routing, ACLs and VLANs) used when designing 
enterprise networks. 

The configuration file for a Cisco device consists of 
several types of stanzas (devices from other vendors have 
similar stanza-oriented configurations). A stanza is de- 
fined as the largest continguous block of commands that 
encapsulate a piece of the router’s functionality. 

In Figure 2, we show a simple configuration file con- 
sisting of the three most relevant classes of stanzas: inter- 
face in lines 1-3, routing protocol in lines 5-11, and ACL 
in lines 13-15. The behavior exhibited by a router can be 
explained by the interactions between various instances 
of the identified stanzas. 

Egress filtering, 1.e., preventing local hosts from send- 
ing traffic with IP addresses that does not belong to them, 
has become a popular way to combat IP-address hijack- 
ing. Networks implement egress filtering by defining a 
packet filter for each interface and creating a reference to 
the appropriate ACL from the interfaces. For example, 
line 3 exemplifies the commands an operator would use 
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to setup the appropriate references. 

The purpose of most layer-3 devices is to provide 
network-wide reachability by leveraging layer-3 rout- 
ing protocols. Network-wide reachability can be imple- 
mented by adding a routing stanza and making references 
between that stanza and the appropriate interfaces. Lines 
5-11 declare a simple routing stanza with line 8 making 
a reference between this routing protocol and the inter- 
face defined earlier. Even in this simple case, the peer 
routing protocol stanza on neighboring devices must be 
configured consistent with this stanza before routes can 
propagate between the devices and through the network. 

More complex reachability constraints can be imposed 
by controlling route distribution using ACLs. Line 15 is 
a filter used to control the announcements received from 
the peer routing process on neighboring routers. 

VLANs are widely used to provide fine grain control 
of connectivity, but they can complicate configuration by 
providing an alternate means for packets to travel be- 
tween hosts that is independent of the layer-3 configu- 
ration. In a typical usage scenario, each port on a switch 
is configured as layer-2 or layer-3. For each layer-3 port 
there is an interface stanza describing its properties. Each 
layer-2 port is associated with a VLAN V. The switches 
use trunking and spanning tree protocols to ensure that 
packets received on a layer-2 port belonging to VLAN 
V can be received by every host connected to a port on 
VLAN V on any switch. 

Layer-2 VLANs interact with layer-3 mechanisms via 
virtual layer-3 interfaces — an interface stanza not as- 
sociated with any physical port but bound to a specific 
VLAN (lines 1—3 in Figure 2). Packets “sent” out the 
virtual interface are sent out the physical ports belonging 
to the VLAN and packets received by the virtual inter- 
face are handled using the layer-3 routing configuration. 


4 Reference Chains 


As the above description indicates, enabling the intended 
level of reachability between different parts of a network 
requires establishing reference links in the configuration 
files of devices. Reference links can be of two types: 
those between stanzas in a configuration file (intra-file 
references) and those across stanzas in different config- 
uration files (inter-file). Intra-file references are explic- 
itly stated in the file, e.g. the links in line 8 (Figure 2) 
from a routing stanza to an interface, and in line 10 from 
a routing stanza to an ACL — these must be internally 
consistent to ensure router-local policies (e.g. ingress fil- 
ters and locally attached networks) are correctly imple- 
mented. Inter-file references are created when multiple 
routers refer to the same network object (e.g., a VLAN or 
subnet); these are central to configuring many network- 
wide functions, and crucially, routing and reachability. 
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Unlike their intra-file counterparts, not all inter-file ref- 
erences can be explicitly declared. For example, line 2 
refers to a subnet which is an example of an entity that 
cannot be explicitly declared. 

As our interviews with operators indicate (84.3), in 
some networks the reference links must be manually es- 
tablished. In other networks, some of the reference links 
within a device are set using automated tools, but many 
of the inter-file references such as trunking a VLAN on 
multiple routers and setting routing protocol adjacencies 
must be managed manually. 

To quantify the complexity of reference links, we first 
construct a referential dependency graph based on device 
configuration files. We compute a set of first-order com- 
plexity metrics which quantify the worst case complexity 
of configuring reference links in the network. Because 
reference links often play a role in implementing some 
network-wide functionality, we also define second order 
metrics that estimate the overall complexity of configur- 
ing such functionality. We focus on routing in this dis- 
cussion, as operators report it is a significant concern. 


4.1 Referential Dependence Graph 


We use a two-step approach to parse configuration files 
and create a dependency graph. 

1. Symbol Table Creation. Router vendor documenta- 
tion typically lists the commands that can appear within 
each configuration stanza and the syntax for the com- 
mands. Based on this, we first create a grammar for con- 
figuration lines in router configuration files. We build a 
simple parser that, using the grammar, identifies “tokens” 
in the configuration file. It records these tokens in a sym- 
bol table along with the stanza in which they were found 
and whether the stanza defined the token or referred to 
it. For example, the access-list definitions in lines 13- 
14 of Figure 2 define the token ACL 9 and line 3 adds a 
reference to ACL 9. 

2. Creating Links. In the linking stage, we create refer- 
ence edges between stanzas within a single file or across 
files based on the entries in the symbol table. We create 
unidirectional links from the stanzas referencing the to- 
kens to the stanza declaring the tokens. Because every 
stanza mentioning a subnet or VLAN is both declaring 
the existence of the subnet or VLAN and referencing the 
subnet/VLAN, we create a separate node in the reference 
graph to represent each subnet/VLAN and create bidirec- 
tional links to it from stanzas that mention it. 

We also derive maximal sub-graphs relating to 
Layer-3 control plane functionality, called “routing in- 
stances” [19]. A routing instance is the collection of 
routing processes of the same type on different devices 
in a network (e.g. OSPE processes) that are in the transi- 
tive closure of the “adjacent-to” relationship. We derive 


these adjacencies by tracing relationships between rout- 
ing processes across subnets that are referenced in com- 
mon by neighboring routers. Taken together, the routing 
instances implement control plane functionality in a net- 
work. In many cases, enterprise networks use multiple 
routing instances to achieve better control over route dis- 
tribution, and to achieve other administrative goals [19]. 
For example, some enterprises will place routes to dif- 
ferent departments into different instances — allowing 
designers to control reachability by controlling the in- 
stances in which a router participates. Thus, it is impor- 
tant to understand the complexity of configuring refer- 
ence links that create routing instances. 


4.2 Complexity Metrics 


We start by capturing the baseline difficulty of creating 
and tracking reference links in the entire network. The 
first metric we propose is the average configuration com- 
plexity, defined as the total number of reference links in 
the dependency graph divided by the number of routers. 
This provides a holistic view of the network. 

We also develop three second-order metrics of the 
complexity of configuring the Layer-3 control plane of 
a network. First, we identify the number of interacting 
routing policy units within the network that the operator 
must track globally. To do this, we count the number of 
distinct routing instances in the entire network. Second, 
we capture the average difficulty of correctly setting each 
routing instance by calculating the average number of 
reference links per instance. Finally, we count the num- 
ber of routing instances each router participates in. In 
all three cases, it follows from the definition of the met- 
rics that higher numbers imply greater complexity for a 
network. 


4.3. Insights From Operator Interviews 


We derived referential complexity metrics for all seven 
networks. Our observations are summarized in Table 2. 
Interestingly, we note that the referential metrics are dif- 
ferent across networks — e.g. very low in the cases of 
Enet-1 and much higher for Univ-1. For five of the seven 
networks, we discussed our findings regarding referential 
dependencies with network operators. 

We present the insights we derived focusing on 3 key 
issues: (1) validation: are the referential dependencies 
we inferred correct and relevant in practice (meaning that 
these are links that must be created and maintained for 
consistency and/or correctness)? (2) complexity: are our 
complexity metrics indicative of the amount of difficulty 
operators face in configuring their networks? (3) causes: 
what caused the high referential complexity in the net- 
works (where applicable)? 
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Network | Avg ref Layer-3 functionality Int? 
complexity | Num routing | Complexity | Instances 
per router | instances | per instance | per router 


Punivs [40 
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Table 2: Complexity due to referential dependence. Net- 
works where we validated results are marked with a “Y.” 


Validation. We showed each network’s referential de- 
pendence graph to the network operators, along with sub- 
graphs corresponding to the routing protocol configura- 
tion in their network. All operators confirmed that the 
classes of links we derived (e.g. between stanzas of spe- 
cific kinds, link within stanzas and across routers) were 
relevant. We also gave operators tasks involving changes 
to device configurations (specifically, add or remove a 
specific subnet, apply a new filter to a collection of in- 
terfaces). We verified that our reference links tracked the 
action they took. These two tests, while largely subjec- 
tive, validated our referential dependency derivation. 

As an aside, the dependency graph seems to have sig- 
nificant practical value: Univ-1 and Enet-1 operators felt 
the graph was useful to visualize their networks’ struc- 
ture and identify anomalous configurations. 

Do the metrics reflect complexity? Our second goal 
was to test if the metrics tracked the difficulty of main- 
taining referential links in the network. To evaluate this, 
we gave the operators a baseline task: add a new subnet 
at arandomly chosen router. We measured the number of 
steps required and the number of changes made to rout- 
ing configuration. This is summarized below. 


Num changes to routing 


tnivs [4 [0 SSC—S™ 
PEnecI [1 [0 +t 





In networks where the metrics are high (Table 2), op- 
erators needed more steps to set up reference links and to 
modify more routing stanzas. Thus, the metrics appear to 
capture the difficulty faced by operator in ensuring con- 
sistent device-level and routing-level configuration. We 
elaborate on these findings below. 

In Univ-1, the operators used a home-grown auto- 
mated tool that generates configuration templates for 
adding a new subnet. Thus, although there are many ref- 
erences to set, automation does help mitigate some as- 
pects of this complexity. 

Adding the subnet required Univ-1’s operator to mod- 
ify routing instances in his network. Just as our second 
order complexity metrics predicted, this took multiple 
steps of manual effort. The operator’s automation tool 
actually made it harder to maintain references needed 
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for Layer-3 protocols. Note from Table 2 that an aver- 
age Univ-1 router has two routing instances present on 
it. These are: a “global” OSPF instance present on all 
core routers and a smaller per-router RIP instance. The 
RIP instance runs between a router and switches directly 
attached to the router, and is used to distribute subnets 
attached to the switches into OSPF. On the other hand, 
OSPF is used to enable global reachability between sub- 
nets and redistribute subnets that are directly attached 
to the router. When a new subnet is added to Univ-1, 
the operator’s tool automatically generates a network 
command and incorporates it directly into the OSPF in- 
stance. When the subnet needs to be attached to a Layer- 
2 switch, however, the network statement needs to be 
incorporated into RIP (and not OSPF). Thus, the operator 
must manually undo the change to OSPF and update the 
RIP instance. Unlike the OSPF instance, the network 
statements in RIP require parameters that are specialized 
to a switch’s location in the network. 

Univ-3 presents a contrast to Univ-1. The operator 
in Univ-3 required 4 steps to add the subnet and this is 
clearly shown by the first order complexity metric for 
Univ-3. In contrast to Univ-1, however, almost all of 
the steps were manual. In another stark difference from 
Univ-1, the operator had no changes to make to the rout- 
ing configuration. This is because the network used ex- 
actly one routing instance that was setup to redistribute 
the entire IP space. This simplicity is reflected in the very 
low second order metrics for Univ-3. 

The operator in Enet-1 had the simplest job overall. He 
had to perform | simple step: create an interface stanza 
(this was done manually). Again, the routing configura- 
tion required little perturbation. 

In general, we found that the metrics are not directly 
proportional to the number of steps required to complete 
a management task like adding a subnet, but the number 
of steps required is monotonically increasing with refer- 
ential complexity. For example, Univ-1 with a reference 
metric of 41.75 required 4-5 steps to add a subnet. Univ- 
2, with a metric of 4.1 needed 4 steps and Enet-1 with a 
metric of 1.6 needed just one step. 

Causes for high complexity. The most interesting 
part of our interviews was understanding what caused 
the high referential complexity in some networks. The 
reasons varied across networks, but our study highlights 
some of the key underlying factors. 

The first cause we established was the impact of a net- 
work’s evolution over time on complexity. In Univ-1, ap- 
proximately 70% of reference links arose due to “no pas- 
sive interface” statements that attempt to create routing 
adjacencies between neighboring devices. Upon closer 
inspection, we found that a large number of these links 
were actually dangling references, with no correspond- 
ing statement defined at the neighboring router; hence, 
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they played no role in the network’s routing functionality. 
When questioned, the operator stated that the commands 
were used at one point in time. As the network evolved 
and devices were moved, however, the commands be- 
came irrelevant but were never cleaned up. 

The high second order complexity in Univ-1 results 
from an interesting cause - optimizing for monetary cost 
rather than reducing complexity. Univ-1’s operator could 
have used a much smaller number of routing instances 
(e.g. a single network-wide OSPF) with lower referen- 
tial counts to achieve the goal of spreading reachability 
information throughout the network. However, accord- 
ing to the operator, using OSPF on a small number of 
routers, and RIP between switches and routers, was sig- 
nificantly cheaper as OSPF-licensed switches cost more. 
Hence this routing design was adopted although it was 
more complex. 

Sometimes, the policies being implemented may re- 
quire high referential complexity. For instance, Univ-3 
imposes global checks for address spoofing, and there- 
fore applies egress filters on all network interfaces. 
These ACLs accounted for approximately 90% of the 
links in the dependency graph. Similarly, Univ-4 uses 
ACLs extensively, resulting in high referential complex- 
ity. Despite similar underlying cause, Univ-4 has a 
higher complexity value than Univ-3 because it employs 
significantly more interfaces and devices. 


5 Router Roles 


When creating a network, operators typically start by 
defining a base set of behaviors that will be present 
across all routers and interfaces in the network. They 
then specialize the role of routers and interfaces as 
needed to achieve the objectives for that part of the net- 
work, for example, adding rate shaping to dorm subnets, 
and additional filters to protect administrative subnets. 

Designers often implement these roles using configu- 
ration templates [6]. They create one template for each 
role, and the template specifies the configuration lines 
needed to make the router provide the desired role. Since 
the configuration might need to be varied for each of the 
routers, template systems typically allow the templates to 
contain parameters and fill in the parameters with appro- 
priate values each time the template is used. For exam- 
ple, the template for an egress filter might be as shown in 
Figure 3, where the ACL restricts packets sent by inter- 
face III to those originating from the subnet configured 
to the interface. The designer creates specific configu- 
ration stanzas for a router by concatenating together the 
lines output by the template generator for each behavior 
the router is supposed to implement. 

From a complexity stand-point, the more base behav- 
iors defined within the network, the more work an oper- 


interface Ill 

ip access-group 5 in 

ip address AAA SSS 
access-list 5 permit AAA SSS 
access-list 5 deny any 


Figure 3: Example of a configuration template. 


ator will have to do to ensure that the behaviors are all 
defined and configured correctly and consistently. Fur- 
ther, the greater the degree of specialization required by 
routers to implement a template role, the more complex 
it becomes to configure the role. 

We show how to work backwards from configurations 
to retrieve the original base behaviors that created them. 
By doing so, we can measure two key aspects of the dif- 
ficulty of configuring roles on different routers in a net- 
work: (1) how many distinct roles are defined in the net- 
work? (2) How many routers implement each role? 


5.1 Copy-Paste Detection 


We identify roles that are “shared” by multiple routers 
using a copy-paste detection technique. This technique 
looks for similar stanzas on different routers. 

We build the copy-paste detection technique using 
CCFinder [17], a tool that has traditionally been used 
to identify cheating among students by looking for text 
or code that has been cut and paste between their as- 
signments. We found that CCFinder by itself does not 
identify templates of the sort used in router configuration 
(e.g., Figure 3). To discover templates, we automatically 
preprocess every file with generalization. Generaliza- 
tion replaces the command arguments that may vary with 
wild card entries — for example, IP addresses are replaced 
by the string “IPADDRESS”. Our implementation uses 
the grammar of the configuration language (Section 4) to 
identify what parameters to replace. 


5.2 Complexity Metrics 


Our first metric is the number of base behaviors defined 
within the network. We define a base behavior as a maxi- 
mal collection of shared-template stanzas that appear to- 
gether on a set of two or more routers. As the number of 
base behaviors increases, the basic complexity of config- 
uring multiple roles across network routers increases. 

To compute the number of base behaviors, we first 
identify the shared-template device set of each template 
— this is the set of devices on which the configuration 
template is present. Next, we coalesce identical sets. To 
elaborate, we write the device set for a shared-template 
stanza as ST; = {D{, Dj,..., Di,,} where the D’ rep- 
resents a router that contains a configuration stanza gen- 
erated from shared template 7. We scan the shared- 
template device sets to identify identical sets: If two 
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Table 3: Roles extracted from ACLs. 


shared-template stanzas are present on the same set of 
routers, then the stanzas can be considered to have arisen 
from a single, larger template; the stanzas are merged and 
one of the sets discarded. The final number of distinct 
device sets that remain is the number of base behaviors. 

As a second order metric, we quantify the uniformity 
among devices in terms of the behaviors defined on them. 
If all devices in the network exhibit the same set of be- 
haviors (1.e., they all have the same shared-template), 
then once an operator understands how one router be- 
haves, it will be easier for him to understand how the 
rest of the routers function. Also, updating the roles is 
simple, as all routers will need the same update. 

To measure uniformity, we compute the median and 
mean numbers of devices in the device sets. We evalu- 
ated other information-theoretic metrics such as entropy. 
However, as our empirical study will show, these simple 
metrics, together with the number of base behaviors, suf- 
fice to characterize the behaviors defined in a network. 


5.3. Insights from Operator Interviews 


Like the referential metrics, we validated our role metrics 
through interviews with five operators. For this discus- 
sion, we focus on the use of ACLs, and Table 3 shows the 
role metrics for each network. We also evaluated roles 
across the entire configuration file, and the results are 
consistent with those for ACLs. 

Validation. When shown the shared templates ex- 
tracted by our system, each of the operators immediately 
recognized them as general roles used in their networks 
and stated that no roles were missed by our technique. 
For example, Univ-1 operators reported seven roles for 
ACLs in their network: one role for a SNMP-related 
ACL, one role for an ACL that limits redistribution of 
routes between OSPF and RIP (these first two roles are 
present on most routers) and five ACLs that filter any 
bogus routes that might be advertised by departmental 
networks connected to the university network, one for 
each of the five departments (these are found only on the 
routers where the relevant networks connect). 

Enet-3 has separate templates for sub-networks that 
permit multicast and those that do not, as well as tem- 
plates encoding special restrictions applied to several 
labs and project sub-networks. Enet-1, the network with 
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the fewest shared-templates, has a pair of core routers 
that share the same set of ACLs. The remaining ACLs in 
the network are specific to the special projects subnets 
that are configured on other routers. Univ-4, the net- 
work with the most shared-templates, has so many roles 
as it uses multiple different types of egress filters, each of 
which is applied to subset of the routers. There are also 
several special case requests from various departments, 
each represented as an ACL applied to 2-3 routers. 

Do the metrics reflect complexity? The relationship 
between number of roles and the complexity of the net- 
work is indicated by type of tools and work process used 
by the operators. 

Operators of the network with the fewest roles, Enet-1, 
modify all the ACLs in their network manually — they 
are able to manage without tools due to the uniformity 
of their network. Operators at Univ-1 have tools to gen- 
erate ACLs, but not track relationships between ACLs, 
so they push all ACLs to all routers (even those that do 
not use the ACL) in an effort to reduce the complexity 
of managing their network by increasing the consistency 
across the configuration files (our shared template system 
was programmed to ignore ACLs that are not used by the 
router: this explains why the mean device set size is not 
larger for Univ-1). The environment at Univ-3 is similar 
to Univ-1, with roughly the same number of ACL roles 
and similar tools that can create ACLs from templates, 
but not track relationships between them. The Univ-3 
operators took the opposite approach to Univ-1, pushing 
each ACL only to the routers that use it, but using man- 
ual process steps to enforce a discipline that each ACL 
contain a comment line listing all the routers where an 
instance of that ACL is found. Operators then rely on 
this meta-data to help them find the other files that need 
to be updated when the ACL is changed. 

Causes for high complexity. In general, the number 
of shared-templates we found in a network directly cor- 
relates with the complexity of the policies the operators 
are trying to realize. For example, for Univ-1’s goal of 
filtering bogus route announcements from departmental 
networks requires applying a control plane filter at each 
peering point. Similarly, Univ-4 has policies defining 
many different classes of subnets that can be attached to 
the network, each one needing its own type of ACL (e.g., 
egress filtering with broadcast storm control and filtering 
that permits DHCP). There is no way around this type of 
complexity. 

Interestingly, the number of roles found in a network 
appears to be largely independent of the size of the net- 
work. For example, Enet-2 and Enet-3 have the same 
number of roles even though they differ greatly in size. 
Rather, the number of roles seems to stem directly from 
the choices the operators made in designing their net- 
works, and how uniform they chose to make them. 
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6 Inherent Complexity 


A network’s configuration files can be viewed as the 
“tools” used by network operators to realize a set of 
network-wide reachability policies. These policies deter- 
mine whether a network’s users can communicate with 
different resources in the network (e.g other users or ser- 
vices). The policies that apply to a user could depend on 
the user’s “group,” her location, and other attributes. 

The reachability policies fundamentally bound an op- 
erator’s ability to employ simple configurations network- 
wide. Consider a network with a “simple” reachability 
policy, such as an all-open network that allows any pairs 
of users to have unfettered communication, or at the op- 
posite end of the spectrum, a network where all commu- 
nication except those to a specific set of servers is shut 
off. Such policies can be realized using fairly simple 
network configurations. On the other hand, for networks 
where the reachability policies are complex, i.e., where 
subtle differences exist between the constraints that ap- 
ply to different sets of users, implementing the policies 
will require complex configuration. 

We develop a framework for quantifying the complex- 
ity of a network’s reachability policies. We refer to this 
as the network’s inherent complexity. We use feedback 
from operators to both validate our metrics and under- 
stand the factors behind the inherent complexity (where 
applicable). Ultimately, we wish to tie inherent complex- 
ity back to the configuration complexity and examine the 
relationship between the two. We discuss this in 86.3. 

To derive inherent complexity, we first derive the static 
reachability between network devices, which is the set of 
packets that can be exchanged between the devices. We 
also refer to this as the reachability set for the device pair. 
Our inherent complexity metrics essentially quantify the 
level of uniformity (or the lack of it) in the reachability 
sets for various paths in a network. 


6.1 Reachability Sets 


For simplicity, we assume that network routers have IP 
subnets attached to them, and that each IP address in 
a subnet corresponds to a single host. The reachability 
set for two routers A and C’ in a network, denoted by 
R(A,C), is the set of all IP packets that can originate 
from hosts attached to A (if any), traverse the A — C' 
path, and be delivered to hosts attached at C’ (if any). 
The composition of the reachability sets reflects how 
a network’s policy limits the hosts at a certain net- 
work location from being reachable from hosts at an- 
other network location. At Layer-3, these policies gen- 
erally apply to 5 fields in the packet’s IP header — the 
source/destination addresses, ports and protocol. When 
first sent, the source and destination addresses on the 
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Figure 4: A toy network with 8 subnets and 5 routers. 
The different constituent sets that play a role in the reach- 
ability set for the A—C path are shown. 

packets could take any of the possible 2° values (the 
same with ports and the protocol field). Control and data 
plane mechanisms on the path might then drop some of 
the packets, either because a router on the path lacks a 
forwarding entry to that destination or due to packet fil- 
ters. R(A,C) identifies the packets that are eventually 
delivered to hosts attached to C’. Note that the maximum 
size of R(A, C) is 292 x |C| x 21© x 21° x 2°, where |C| 
is the total number of hosts attached to C’. 


6.1.1 Reachability Set Computation 


To compute the reachability sets for a network we con- 
sider three separate yet interacting mechanisms: control- 
plane mechanisms (i.e., routing protocols), data-plane 
mechanisms (i.e. packet filters), and Layer-2 mecha- 
nisms (such as VLANs). 

We compute the reachability sets using the following 
three steps: (1) we first compute valid forwarding paths 
between network devices by simulating routing protocols 
(In the interest of space, we omit the details of routing 
simulation; the details are in [5]); (2) we calculate the 
““per-interface” reachability set on each path — this 1s the 
set of all packets that can enter or leave an interface based 
both on forwarding entries as well as packet filters; and 
(3) we compute reachability sets for end-to-end paths by 
intersecting the reachability sets for interfaces along each 
path. The last two steps are illustrated for a simple toy 
network in Figure 4, and explained in detail below. 

We note that our reachability calculation is similar to 
Xie et al.’s approach for static reachability analysis of IP 
networks [27]. However, our approach differs both in the 
eventual goal and the greater flexibility it provides. Xie 
et al. derive all possible forwarding states for a network 
to study the impact of failures, rerouting, etc. on reacha- 
bility. Because we are interested in examining the inher- 
ent complexity of reachability policies, we focus on the 
computationally simpler problem of computing a single 
valid forwarding state for the network, assuming there 
are no failures. Also, our approach takes into account the 
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impact of VLANs on reachability within a network (as 
described in [5]), which Xie et al. does not. The pres- 
ence of VLANs means that routing is effectively a two 
step process: first routing to the VLAN interface, and 
then routing through the VLAN to the destination. Our 
calculation tracks which routers trunk which VLANs to 
enable this second step of the routing computation. 

Single interface. The reachability set for interfaces 
on a path is defined as the set of packets that can enter 
or leave an on-path interface (see figure 4 for examples). 
For interfaces that receive packets, this is composed just 
of the set of packets allowed by inbound data plane fil- 
ters. For interfaces which forward packets further along 
a path, this is the union of packets which are permitted 
by outbound filters and packets whose destination IPs are 
reachable from the interface (this depends on the router’s 
forwarding state). 

Path. To compute (A, C’), we first compute the fol- 
lowing supersets: (1) For A, we compute the Entry set 
which is the union of the inbound interface sets for in- 
terfaces on A — as mentioned above, each set is shaped 
by the inbound filters on the corresponding interface. (2) 
For C’, we compute the F’xzt set which is union of the 
outbound interface sets for interfaces on C’. (3) For in- 
termediate routers, we compute the intersection of the in- 
bound interface set for the interface that receives packets 
from A and the outbound interface set for the interface 
that forwards to C’. Then, R(A, C) is simply the inter- 
section of Entry, Exit and the intermediate sets. 

Some optimizations for efficiency. The above com- 
putation requires us to perform set operations on the in- 
terface and intermediate reachability sets (i.e. set unions 
and intersections). These operations could be very time- 
consuming (and potentially intractable) because we are 
dealing with 5-dimensional reachability sets that could 
have arbitrary overlap with each other. 

To perform these operations efficiently, we convert 
each set into a “normalized” form based on ACL opti- 
mization. Specifically, we represent each reachability set 
as a linear series of rules like those used to define an ACL 
in a router’s configuration, 1.e., a Sequence of permit and 
deny rules that specify attributes of a packet and whether 
packets having those attributes should be allowed or for- 
bidden, where the first matching rule determines the out- 
come. Next, we optimize this ACL representation of the 
sets using techniques that have traditionally been em- 
ployed in firewall rule-set optimization [2, 11]. In the fi- 
nal ACL representation of a reachability set, no two rules 
that make up a set overlap with each other, and we are 
guaranteed to be using the minimal number of such rules 
possible to represent the set. Set operations are easy to 
perform over the normalized ACL representations. For 
instance, to compute the union of two reachability sets 
we merge the rules in the corresponding optimized ACLs 
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Figure 5: Computing the first and second order metrics 
for inherent complexity. 


to create one ACL, and then we optimize the resulting 
ACL. Intersection can be computed in a similar fashion. 


6.2 Complexity Metrics 


As stated before, our metrics for inherent complexity 
quantify the similarity, or equivalently, the uniformity, in 
the reachability sets for various end-to-end paths. If the 
sets are uniformly restrictive (reflecting a “default deny” 
network) or uniformly permissive (an all open network), 
then we consider the network to be inherently simple. 
We consider dissimilarities in the reachability sets to be 
indicative of greater inherent complexity. 

First order metric: variations in reachability. To 
measure how uniform reachability is across a network, 
we first compute the reachability set between all pairs of 
routers. We then compute the entropy of the resulting 
distribution of reachability sets and use this value, the 
reachability entropy, as a measure of uniformity. 

Figure 5 summarizes the computation of reachability 
entropy. To compute the distribution of reachability sets 
over which we will compute the entropy, we must count 
how many pairs of routers have the same reachability. 
Intuitively, if there are NV routers this involves comparing 
N? reachability sets for equality. To simplify this task, 
we compute the reachability set for a pair of routers, turn 
it into optimized ACL form, and then compute a hash 
of the text that represents the optimized set. Identical 
reachability sets have identical hashes, so computing the 
distribution is easy. 

Using the standard information-theoretic definition of 
entropy, the reachability entropy for a network with N 
routers varies from log(N) in a very simple network 
(where the reachability sets between all pairs of routers 
are identical) and log(N7) in a network where the reach- 
ability set between each pair of routers is different. We 
interpret larger values of entropy as indicating the net- 
work’s policies are inherently complex. 

Second order metric: Extent of variations. The en- 
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tropy simply tracks whether the reachability sets for net- 
work paths differ, but it does not tell us the extent of the 
differences. If the reachability sets had even minute dif- 
ferences (not necessarily an indication of great complex- 
ity), the entropy could be very high. Thus, entropy alone 
may over-estimate the network’s inherent complexity. 

To quantify more precisely the variability between the 
reachability sets, we examine the similarity between sets 
using the approach outlined in Figure 5. Unlike the en- 
tropy calculation, where we examined the N? reachabil- 
ity sets between pairs of routers, we examine differences 
from the view point of a single destination router (say C). 
For each pair of source routers, say A and B, we compute 
the similarity metric, Sim(C, A, B) = eae 

We use the set union and intersection algorithms de- 
scribed in Section 6.1.1 to compute the two terms in the 
above fraction. To compute the sizes of the two sets, 
we first optimize the corresponding ACLs. In an opti- 
mized ACL the rules are non-overlapping, so the number 
of packets permitted by an ACL is the sum of the number 
of packets allowed by the ACL’s permit rules. Since each 
rule defines a hypercube in packet space, the number of 
packets permitted by a rule is found by multiplying out 
the number of values the rule allows on each dimension 
(e.g., address, port). 

After computing the similarities in this manner, we 
cluster source routers that have very similar reachabil- 
ity sets (we use an inconsistency cutoff of 0.9) [20, p. 
1-61]. Finally, we sum the number of clusters found 
over all destination routers to compute the number of per- 
destination clusters as our second order metric for inher- 
ent complexity. Ideally, this should be NV; large values 
indicate specialization and imply greater complexity. 


6.3. Insights from Operator Interviews 


Our study of the configuration complexity in Sections 4 
and 5 showed that some of the networks we studied had 
complex configurations. In this section, we examine 
the inherent complexity of these networks. We validate 
our observations using operator feedback. We also use 
the feedback to understand what caused the complexity. 
(Were the policies truly complex? Was there a bug?) 

Our observations regarding the inherent complexity 
for the networks we studied are shown in Table 4. In- 
terestingly, we see that a majority of the networks ac- 
tually had reasonably uniform reachability policies (i.e. 
observed entropy ~ ideal entropy of log(.N)). In other 
words, most networks seem to apply inherently simple 
policies at Layer-3 and below. 

To validate this observation, we verify with the op- 
erators if the networks were special cases that our ap- 
proach somehow missed. We discussed our observations 
with the operators of 4 of the 7 networks. The opera- 
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Figure 6: This figure shows the clusters of routers in 
Univ-2 that have similar reachability to the given des- 
tination router. The X axis is the source router ID. The Y 
axis is distance between the centers of the clusters. 


tor for Enet-1 essentially confirmed that the network did 
not impose any constraints at Layer-3 or below and sim- 
ply provided universal reachability. All constraints were 
imposed by higher-layer mechanisms using middleboxes 
such as firewalls. 


We turn our attention next to the networks where the 
reachability entropy was slightly higher than ideal (Univ- 
1 and Univ-3). This could arise due to two reasons: either 
the network’s policies make minor distinctions between 
some groups of users creating a handful of special cases 
(this would mean that the the policy is actually quite sim- 
ple), or there is an anomaly that the operator has missed. 


In the case of Univ-3, our interaction with the operator 
pointed to the former reason. A single core router was the 
cause of the deviation in the entropy values. During dis- 
cussions with the operator, we found out that the router 
was home to two unique subnets with restricted access. 


Interestingly, in the case of Univ-1 the slight change 
in entropy was introduced by a configuration bug. Upon 
discussing with the operator, we found that one of the 
routers was not redistributing one of its connected sub- 
nets because a network statement was missing from 
a routing stanza on the device. The bug has now been 
fixed. This exercise shows how our first and second or- 
der inherent complexity metrics can detect inconsisten- 
cies between an operator’ s intent and the implementation 
within a network. In networks where the configuration 
is complex — Univ-1 is an example with high referential 
counts and many router roles — such inconsistencies are 
very hard to detect. However, our complexity metrics 
were able to unearth this subtle inconsistency. We finally 
discuss networks where the entropy is much higher than 
ideal. Of these networks, we were able to speak to the 
operator of Univ-2, where both the first and the second 
order metrics are very high. In such networks, one can 
safely conclude that the policies themselves are complex. 
Indeed, Figure 6 examines how similar or different is the 
reachabilty from each of the routers in Univ-2 to three 
key routers: CoreA (Figure 6(a)), CoreB (Figure 6(b)), 
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Table 4: Inherent complexity measures. 


Aggregation (Figure 6(c)). For each router C’, the reach- 
ability set from every other router to that router is com- 
puted, and the distance between the reachability sets 
from routers A and B is computed as 1 — Sim(C, A, B). 
A distance of 0 means the sets are identical, and a dis- 
tance 1 means the sets do not overlap. The dendrogram 
shows a horizontal line between clusters of routers at the 
distance between the centroids of the clusters. 

Interpreting Figure 6, there are 3-5 clusters of routers 
that have essentially the same reachability to both coreA 
and coreB (the only significant difference is that 4, 5, 10 
have identical reachability to coreB, while 4 has slightly 
different reachability to coreA than 5 and 10 do). The 
presence of multiple clusters implies that traffic is be- 
ing controlled by fine grain policies. That the clusters 
of reachability to the Aggregation Router are so different 
than those to the core implies that not only are policies 
fine grain, they differ in different places in the network. 
We argue this means the policies are inherently complex, 
and that any network implementing them will have a de- 
gree of unavoidable complexity. The operator for Univ-2 
agreed with our conclusions. 

Applying this analysis to all the networks we stud- 
ied, Table 4 shows the number of per-destination clus- 
ters, that is, the total number of clusters found summing 
across all the routers in the network (second order met- 
ric). This complexity metric confirms that Univ-1 and 
Enet-1 have inherently simple reachability policies. 

However, this metric’s value stems from the informa- 
tion it provides about networks like Univ-?2, -4 and Enet- 
3. Enet-3 and Univ-4 both have an entropy value roughly 
1.0 higher than ideal. However, Univ-4 has on aver- 
age four different clusters of reachability for each router 
(85/24), while Enet-3 has two clusters per router (40/19). 
This indicates that Enet-3 has reachability sets that are 
not identical, but are so similar to each other they clus- 
ter together, while Univ-4 truly has wide disparity in the 
reachability between routers. Similarly, Univ-2 has an 
entropy metric 1.6 above ideal yet less than two different 
clusters per router, indicating that even when reachability 
sets are not identical, they are very similar. 

Summary of our study. Through interviews with the 
operators we have verified the correctness of our tech- 
niques. We show that our metrics capture the difficulty 
of adding new functionality such as interfaces, of updat- 
ing existing functionality such as ACLs, and of achieving 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 


















_Jaroup2— 








GROUP 1 | 


/ 
/ 


T T T T 
J] \ 
" \ 
a li I 
20 40 


60 80 100 120. 140 160 180 
Network Paths (Grouped by Destination Router) 


g Ratio 





9 





Forwardin 





















































(a) Univ-1 

= GROUP 1 a GROUP 2 le | 

Ball «| ama Gece oe de 

oD * GROUP 3 

< /| 

So. A 

© 

5 

. LL 50 100 15 200 250 .. 300 350 400 

(b) Univ-2 Network Paths (Grouped by Destination Router) 


Figure 7: Sink profiles for Univs 1, 2. Network paths for 
each device are grouped by the destination router. 


high-level policies such as restricting user access. In ad- 
dition to this, we find that other factors, largely ignored 
by previous work (e.g. cost and design) play a larger role 
in affecting a network’s complexity than expected. 


7 An Application: Extracting Hierarchy 


In addition to creating a framework for reasoning about 
the complexity of different network designs, complex- 
ity metrics have several practical uses including helping 
operators visualize and understand networks. In this sec- 
tion, we show how our models can discover a network’s 
heirarchy, information that proves invaluable to operators 
making changes to the network. 

Many networks are organized into a hierarchy, with 
tiers of routers leading from a core out towards the edges. 
The ability to automatically detect this tiering and clas- 
sify routers to it would be helpful to outside technical 
experts that must quickly understand a network before 
they can render assistance. 

We found that computing the sink ratio for each router 
rapidly identifies the tiering structure of a network. The 
sink ratio is based on the reachability analysis done on 
each path, and measures the fraction of packets that a 
router sinks (delivers locally) versus the number it for- 
wards on. Formally, the sink ratio for a path A — B is 
IR sink(A,B)| 

|R(A,B)| 
traffic from A any further. If not, then B plays a role in 
forwarding A’s packets to the rest of the network. 

Figure 7 shows the sink ratio for each path in net- 
works Univ-1 and Univ-2. Univ-2 contains roughly 
4 classes of devices: the edge (Group 2), the 
core (Group 1), intermediate-core (Group 4), and 
intermediate-edge(Group 3). Univ-1 consists of a two- 
layer architecture with three core routers and nine edge 
routers, respectively labeled Group 1 and Group 2. Enet- 
2 (not shown) has low forwarding ratios overall: the 
maximum forwarding ratio itself is just 0.4 and the min- 
imum is 0.15. Thus, we can deduce that all routers in 
Enet-2 play roughly identical forwarding roles and there 
is no distinction of core versus edge routers. 


. If the ratio is 1, then B does not forward 
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$8 Discussion 


We now discuss procedural limitations in our approach to 
quantifying complexity as well as some notions of com- 
plexity that we are currently unable to capture. 

Limitations and extensions. Our approach uses 
the static configuration state of the network. Relying 
on static configurations means that operators can use 
our techniques to do “what-if analysis” of configuration 
changes. The downside is that we ignore the effect of 
dynamic events such as link/router failures and load bal- 
ancing, the mechanisms in place to deal with these, and 
the complexity arising from them. It is unclear if our 
approach can be extended easily to account for these. 

Our current work ignores the impact of packet trans- 
formations due to NAT’s and other middleboxes on com- 
plexity. Packet transformations could alter reachability 
sets in interesting ways, and might not be easy to con- 
figure. Fortunately, transformations were not employed 
in any of the networks we studied. We do believe, how- 
ever, it is possible to extend our techniques to account for 
on-path changes to IP headers. 

Of course, our approaches do not account for tech- 
niques employed above Layer-3 or at very low levels. In 
particular, we currently do not have an easy way to quan- 
tify the complexity of mechanisms which use higher- 
layer handles (e.g. usernames and services) or lower- 
layer identifiers such as MAC addresses. One potential 
approach could be to leverage dynamic mappings from 
the high/low level identifiers to IP addresses (e.g. from 
DNS bindings and ARP tables) and then apply the tech- 
niques we used this in paper. 

Absolute vs relative configuration complexity. We 
note that our metrics for referential complexity and roles 
capture complexity that is apparent from the current con- 
figuration; hence they are absolute in nature. An increase 
in these metrics indicates growing complexity of im- 
plementation, meaning that configuration-related tasks 
could be harder to conduct. However, the metrics them- 
selves do not reflect how much of the existing configu- 
ration is superfluous, or equivalently, what level of con- 
figuration complexity is actually necessary. For this, we 
would need a relative complexity metric that compares 
the complexity of the existing configuration against the 
simplest configuration necessary to implement the oper- 
ators goals (including reachability, cost, and other con- 
traints). However, determining the simplest configura- 
tion that satisfies these requirements is a hard problem 
and a subject for future research. 


9 Related Work 


The work most closely related to ours is [18], which cre- 
ates a model of route redistribution between routing in- 


stances and tries to quantify the complexity involved in 
configuring the redistribution logic in a network. Glue 
Logic and our complexity metrics are similar in that both 
create abstract models of the configuration files and cal- 
culate complexity based on that information. However, 
while [18] limits itself to the configuration complexity of 
route redistribution (the “glue logic”), we examine both 
configuration and inherent complexity, and the relation- 
ship between the two. Our approach also accounts for 
complexity arising from the routing, VLANs and filter- 
ing commands in a configuration file. 

Our study is motivated by [19, 13], which studied op- 
erational networks and observed that the configuration of 
enterprise networks are quite intricate. In [19, 13], mod- 
els were developed to capture the interaction between 
routing stanzas in devices. However, to make inferences 
about the complexity of the networks studied, the authors 
had to manually inspect the models of each network. Our 
work automates the process of quantifying complexity. 

As mentioned in Section 4, we borrow from [19] 
the idea of a routing instance and use it as a way to 
group routing protocols. Also, our referential depen- 
dence graph is similar to the abstractions used in [6, 9]. 
Unlike [6, 9] our abstraction spans beyond the bound- 
aries of a single device, which allows us to define the 
complexity of network-wide configuration. 

Several past studies such as [12, 10, 28, 26, 27] have 
considered how network objectives and operational pat- 
terns can be mined from configuration files. Of these, 
some studies [28, 26, 27] calculate the reachability sets 
and argue for their usage in verifying policy compliance. 
In contrast, the group of complexity metrics we provide 
allow operators to not only verify policy compliance, but 
they also quantify the impact of policy decisions on the 
ability to achieve a simple network-wide configuration. 
Complementary to [10], which proposes high-level con- 
straints that if met ensure the correctness of routing, we 
start with the assumption that the network is correct and 
then derive its properties. 

Contrary to the “bottom-up” approach we take, several 
studies [8, 15, 3] have considered how to make network 
management simpler by building inherent support for the 
creation and management of network policies. We pre- 
sume that our study of configuration and inherent com- 
plexity can inform such ideas on clean slate alternatives. 
Finally, our metrics could be easily integrated into exist- 
ing configuration management tools such as AANTS [1] 
and OpenView [16], and can aid operators in making in- 
formed changes to their network configurations. 

The notion of “complexity” has been explored in do- 
mains such as System Operations [7]. In [7], complex- 
ity is defined as the number of steps taken to perform 
a task, similar to our metrics. Recently, Ratnasamy has 
proposed that protocol complexity be used in addition 
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to efficiency to compare network protocols [23]. Just as 
Ratnasamy’s metrics help choose the right protocol, our 
metrics help pick the right network design. 


10 Conclusions 


Configuration errors are responsible for a large fraction 
of network outages, and we argue that as networks be- 
come more complex the risk of configuration error in- 
creases. This paper takes the first step towards quantify- 
ing the types of complexity that lead operators to make 
configuration mistakes. Creating such metrics is difficult 
as they must abstract away all non-essential aspects of 
network configuration to enable the meaningful compar- 
ison of networks with very different sizes and designs. 

In this paper, we define three metrics that measure the 
complexity of a network by automatic analysis of its con- 
figuration files. We validate the metrics’ accuracy and 
utility through interviews with the network operators. 
For example, we show networks with higher complex- 
ity scores require more steps to carry out common man- 
agement tasks and require more tools or more process 
discipline to maintain. Our study also generated insights 
on the causes of complexity in enterprise networks, such 
as the impact of the cost of network devices on routing 
design choices and the effect of defining multiple classes 
of subnets and multiple device roles. 

We believe our metrics are useful in their own right, 
and we show how they can aid with finding configuration 
errors and understanding a network’s design. However, 
our hope is that these metrics start a larger discussion 
on quantifying the factors that affect network complexity 
and management errors. The definition of good metrics 
can drive the field forward toward management systems 
and routing designs that are less complex and less likely 
to lead human operators into making errors. 

Acknowledgements. We would like to thank our 
shepherd, Kobus van der Merwe, and the reviewers for 
their useful feedback. We would also like to thank Dale 
Carder, Perry Brunelli, and the other operators for their 
network configuration files. This work was supported in 
part by an NSF CAREER Award (CNS-0746531) and an 
NSF NeTS FIND Award (CNS-0626889). 


References 


[1] Authorized Agent Network Tool Suite (AANTS). 
http://www.doit.wisc.edu/network/upgrade/faq/aants.asp. 


[2] ACHARYA, S., WANG, J., GE, Z., ZNATI, T., AND GREEN- 
BERG, A. Simulation study of firewalls to aid improved perfor- 
mance. In ANSS ’06. 


[3] BALLANI, H., AND FRANCIS, P. CONMan: A Step towards 
Network Manageability. In Proc. of ACM SIGCOMM (2007). 


[4] BENSON, T., AKELLA, A., AND MALTZ, D. A. Operator ques- 
tionnaire. http://pages.cs.wisc.edu/ tbenson/questionnaire.html. 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 


[5] 


[6] 


[7] 


[8] 


[9] 


[10] 


[11] 


[12] 


[13] 


[14] 


[15] 


[16] 


[17] 


[18] 


[19] 


[20] 
[21] 


[22] 


[23] 


[24] 


[25] 


[26] 


[27] 


[28] 


BENSON, T., AKELLA, A., AND MALTZ, D. A. A case for 
complexity models in network design and management. Tech. 
Rep. 1643, UW Madison, August 2008. 


CALDWELL, D., GILBERT, A., GOTTLIEB, J., GREENBERG, 
A., HJALMTYSSON, G., AND REXFORD, J. The cutting EDGE 
of IP router configuration. In HotNets (2003). 


CANDEA, G. Toward quantifying system manageability. In Hot- 
Dep (2008), USENIX Association. 


CASADO, M., FRIEDMAN, M., PETTITT, J., MCKEOWN, N., 
AND SHENKER, S. Ethane: Taking Control of the Enterprise. In 
SIGCOMM ’07. 


CHEN, X., MAO, Z. M., AND VAN DER MERWE, J. Towards 
automated network management: network operations using dy- 
namic views. In INM ’07. 


FEAMSTER, N. _ Rethinking routing configuration: 
stimulus-response reasoning. In WJRED (Oct ’03). 


Beyond 


FELDMANN, A., AND MUTHUKRISHNAN, S._ Tradeoffs for 
packet classification. In INFOCOM 2000. 


FELDMANN, A., AND REXFORD, J. IP network configuration 
for intradomain traffic engineering. Network, IEEE 15 (Sep ’01). 


GARIMELLA, P., SUNG, Y.-W. E., ZHANG, N., AND RAO, S. 
Characterizing VLAN usage in an operational network. In INM 
"07. 


GRAY, J., Ed. The Benchmark Handbook for Database and 
Transaction Processing Systems. Morgan Kaufmann, 1991. 


GREENBERG, A., HJALMTYSSON,G., MALTZ, D. A., MYERS, 
A., REXFORD, J., XIE, G., YAN, H., ZHAN, J., AND ZHANG, 
H. A Clean Slate 4D Approach to Network Control and Manage- 
ment. ACM Sigcomm CCR (2005). 


HEWLETT-PACKARD. Enterprise Management Software: HP 
OpenView. http://h20229.www2.hp.com/. 


KAMIYA, T., KUSUMOTO, S., AND INOUE, K. Ccfinder: a 
multilinguistic token-based code clone detection system for large 
scale source code. [EEE Trans. Softw. Eng. 28, 7 (2002). 


LE, F., XIE, G. G., PEI, D., WANG, J., AND ZHANG, H. Shed- 
ding light on the glue logic of the Internet routing architecture. In 
SIGCOMM (2008). 

MALTZ, D. A., ZHAN, J., XIE, G., HJALMTYSSON, G., 


GREENBERG, A., AND ZHANG, H. Routing Design in Opera- 
tional Networks: A Look from the Inside. In SIGCOMM (2004). 


MATHWORKS. Statistics Toolbox for Use with MATLAB, 1999. 


MCCABE, T., AND BUTLER, C. Design Complexity Measure- 
ment and Testing. Communications of the ACM 32, 12 (1989). 


OPPENHEIMER, D., GANAPATHI, A., AND PATTERSON, D. A. 
Why do Internet services fail, and what can be done about it? In 
USITS (2003). 


RATNASAMY, S. Capturing Complexity in Networked Systems 
Design: The Case for Improved Metrics. In HotNets (2006). 


RYBACZYK, P. Network Design Solutions for Small-Medium 
Businesses. Cisco, 2004. 


THOMAS, T., AND KHAN, A. Network Design and Case Studies 
(CCIE Fundamentals). Cisco, 1999. 


WONG, E. W. W. Validating network security policies via static 
analysis of router ACL configuration. Master’s thesis, Naval Post- 
graduate School (U.S.), 2006. 


XIE, G., ZHAN, J., MALTZ, D. A., ZHANG, H., GREENBERG, 
A., HJALMTYSSON, G., AND REXFORD, J. On static reachabil- 
ity analysis of IP networks. In Proc. IEEE INFOCOM (2005). 


ZHANG, B., NG, T. S. E., AND WANG, G. Reachability mon- 
itoring and verification in enterprise networks. In SIGCOMM 
Poster (Nov. 2008). 


USENIX Association 


USENIX Association 


NetPrints: Diagnosing Home Network Misconfigurations 
Using Shared Knowledge 


Bhavish Aggarwal', Ranjita Bhagwan’, Tathagata Das’, 
Siddharth Eswaran* Venkata N. Padmanabhan‘, and Geoffrey M. Voelker! 


*Microsoft Research India 


Abstract 


Networks and networked applications depend on sev- 
eral pieces of configuration information to operate cor- 
rectly. Such information resides in routers, firewalls, 
and end hosts, among other places. Incorrect informa- 
tion, or misconfiguration, could interfere with the run- 
ning of networked applications. This problem is particu- 
larly acute in consumer settings such as home networks, 
where there is a huge diversity of network elements and 
applications coupled with the absence of network ad- 
ministrators. 

To address this problem, we present NetPrints, a sys- 
tem that leverages shared knowledge in a population of 
users to diagnose and resolve misconfigurations. Basi- 
cally, if a user has a working network configuration for 
an application or has determined how to rectify a prob- 
lem, we would like this knowledge to be made available 
automatically to another user who is experiencing the 
same problem. NetPrints accomplishes this task by ap- 
plying decision tree based learning on working and non- 
working configuration snapshots and by using network 
traffic based problem signatures to index into configura- 
tion changes made by users to fix problems. We de- 
scribe the design and implementation of NetPrints, and 
demonstrate its effectiveness in diagnosing a variety of 
home networking problems reported by users. 


1 Introduction 


A typical network comprises several components, in- 
cluding routers, firewalls, NATs, DHCP, DNS, servers, 
and clients. Configuration information residing in each 
component controls its behaviour. For example, a fire- 
wall’s configuration tells it which traffic to block and 
which to let through. Correctness of the configuration 
information is thus critical to the proper functioning of 
the network and of networked applications. Misconfigu- 
ration interferes with the running of these applications. 
This problem is particularly acute in consumer set- 
tings such as home networks given the huge diversity in 
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network elements and applications which are deployed 
without the benefit of vetting and standardization that 
is typical of enterprises. An application running in the 
home may experience a networking problem because of 
a misconfiguration on the local host or the home router, 
or even on the remote host/router that the application at- 
tempts to communicate with. Worse still, the problem 
could be caused by the interaction of various configura- 
tion settings on these network components. Table 1 il- 
lustrates this point by showing a set of typical problems 
faced by home users. Owing to the myriad problems 
that home users can face, they are often left helpless, not 
knowing which, if any, of a large set of configuration 
settings to manipulate. 


Nevertheless, it is often the case that another user has 
a working network configuration for the same applica- 
tion or has found a fix for the same problem. Moti- 
vated by this observation, we present NetPrints (short 
for Network Problem Fingerprints), a system that helps 
users diagnose network misconfigurations by leveraging 
the knowledge accumulated by a population of users. 
This approach is akin to how users today scour through 
online discussion forums looking for a solution to their 
problem. However, a key distinction is that the accu- 
mulation, indexing, and retrieval of shared knowledge in 
NetPrints happens automatically, with little human in- 
volvement. 





NetPrints comprises client and server components. 
The client component, which runs on end hosts such as 
home PCs, gathers configuration information pertaining 
to the local host and network configuration, and possibly 
also the remote host and network that the client applica- 
tion is attempting to communicate with. In addition, it 
captures a trace of the network traffic associated with an 
application run and extracts a feature vector that charac- 
terizes the corresponding network communication. The 
client uploads this information to the NetPrints server 
at various times, including when the user encounters a 
problem and initiates diagnosis. We enlist the user’s help 
in a minimally intrusive manner to have the uploaded 
information labeled as “good” or “bad’’, depending on 
whether the corresponding application run was success- 
ful or not. 
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The NetPrints server performs decision tree based 
learning on the labeled configuration information sub- 
mitted by clients to construct a configuration tree, which 
encodes its knowledge of the configuration settings that 
work and ones that do not. Furthermore, it uses the la- 
beled network feature vectors to learn a set of signatures 
that help distinguish among different modes of failure of 
an application. These signatures are used to index into a 
set of change trees, which are constructed using config- 
uration snapshots gathered before and after a configura- 
tion change was made to fix a problem. At the time of 
diagnosis, given the suspect configuration information 
from the client, the NetPrints server uses a configuration 
mutation algorithm to automatically suggest fixes back 
to the user. 

We have prototyped the NetPrints system on Win- 
dows Vista and made a small-scale deployment on 4 
broadband-connected PCs. We present a list of 21 
configuration-related home networking problems and 
their resolutions from online discussion boards, user sur- 
veys, and our own experience. We believe that all of 
these problems and others similar to them can be diag- 
nosed and fixed by NetPrints. We were able to obtain the 
necessary resources to reproduce 8 of these problems for 
4 applications in our small deployment and also our lab- 
oratory testbed. Since we do not have configuration data 
or network traces from a large population of users, we 
perform learning on real data gathered for the applica- 
tions run in our testbed, where we artificially vary the 
network configuration settings to mimic real-world di- 
versity of configurations. Our evaluation demonstrates 
the effectiveness and robustness of NetPrints even in the 
face of mislabeled data. 

Our focus in this paper is on the diagnostics aspects 
of NetPrints. We are doing separate work on the pri- 
vacy, data integrity, and incentives aspects as well but 
do not discuss these here. Also, our focus here is on 
network configuration problems that interfere with spe- 
cific applications but do not result in full disconnection 
and, in particular, do not prevent communication with 
the NetPrints server. Indeed, these subtle problems tend 
to be much more challenging to diagnose than basic con- 
nectivity problems such as full disconnection. In future 
work, we plan to investigate the use of out-of-band com- 
munication (e.g., via a physical medium) to enable Net- 
Prints diagnosis even with full disconnection. 


2 Related Work 

We discuss prior work on problem diagnosis in computer 
systems and in networks, and how NetPrints relates to it. 
2.1 Peer Comparison-based Diagnosis 


There has been prior work on leveraging shared knowl- 
edge across end hosts, which provides inspiration for a 
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similar approach in NetPrints. However, the prior work 
differs from NetPrints in significant ways. 

Strider [19] uses a state-based black-box approach for 
diagnosing Windows registry problems by performing 
temporal and spatial comparisons with respect to known 
healthy states. It assumes the ability to explicitly trace 
what configuration information is accessed by an appli- 
cation run. Such state tracing would be difficult to do 
with network configuration, which governs policy (e.g., 
port-based filtering) that implicitly impacts an applica- 
tion’s network communication rather than being explic- 
itly accessed by applications. 

PeerPressure [18] extends Strider by eliminating the 
need to identify a single healthy machine for compari- 
son. Instead, it relies on registry settings from a large 
population of machines, under the assumption that most 
of these are correct. It then uses Bayesian estimation 
to produce a rank-ordered list of the individual registry 
key settings presumed to be the culprits. While this un- 
supervised approach has the advantage of not requiring 
the samples to be labeled, it also means that PeerPres- 
sure will necessarily find a “culprit”, even when there 
is none. This outcome might not be appropriate in a 
networking setting, where a problem might be unrelated 
to client configuration. Also, PeerPressure is unable to 
identify combinations of configuration settings that are 
problematic. 

Finally, Autobash [15] helps diagnose and recover 
from system configuration errors by recording the user 
actions to fix a problem on one computer and then re- 
playing and testing these on another computer that 1s ex- 
periencing the same problem. Autobash assumes sup- 
port for causality tracking between configuration set- 
tings and the output, which is akin to state tracing in 
Strider discussed above. 


2.2 Problem Signature Construction 


There has been work on developing compact signatures 
for systems problems for use in indexing a database of 
known problems and their solutions. 

Yuan et al. [21] generate problem signatures by 
recording system call traces, representing these as 
n-grams, and then applying support vector machine 
(SVM) based classification. Cohen et al. [8,9] con- 
sider the problem of automated performance diagnosis 
in server systems. They use Tree-Augmented Bayesian 
Networks (TANS) to identify combinations of low-level 
system metrics (e.g., CPU usage) that correlate well with 
high-level service metrics (e.g., average response time). 

In contrast, NetPrints uses a set of network traf- 
fic features, which we have picked based on our net- 
working domain knowledge, to construct problem signa- 
tures. Since these network traffic features tend to be OS- 
independent, NetPrints would be in a position to share 
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signatures across OSes. Furthermore, we use a decision 
tree based classifier to learn the signatures. 


2.3. Network Problem Diagnosis 


Active probing is widely used for diagnosing network 
problems. For example, Tulip [12] probes routers to 
localize anomalies such as packet reordering and loss. 
Such diagnosis relies on a model of how network ele- 
ments such as routers operate. Likewise, several model- 
or rule-based engines have been developed for diag- 
nosing configuration-related and other faults in wireless 
LANs. These include systems that rely on infrastructure- 
based monitoring (e.g., DAIR [5], Jigsaw [7]) and those 
that rely on cooperation among wireless clients (e.g., 
WiFiProfiler [6]). 


Other diagnosis systems such as SCORE [11] and 
Sherlock [4] have modeled, and in some cases automat- 
ically discovered, dependencies between higher-layer, 
observable network events and the underlying network 
components. Formal methods have also been used to 
check the correctness of network configurations. For ex- 
ample, rcc [10] checks for a range of well-understood 
BGP properties. 


In the context of NetPrints, it may be possible to con- 
struct such models for certain well-understood configu- 
ration settings (e.g., port-based filters), thereby allowing 
diagnosis based on active probing, rules, or formal meth- 
ods. However, in general, configuration settings may 
not be documented or well-understood, hence NetPrints’ 
black-box approach. 


2.4 NetPrints Compared to Prior Work 


We view NetPrints as being complementary to prior 
work on network diagnosis in two ways. First, NetPrints 
focuses on configuration problems that impact specific 
applications rather than on broad problems that impact 
the network infrastructure. Second, NetPrints uses a 
blackbox approach appropriate for arbitrary and poorly 
understood configuration information, avoiding the need 
for the network behaviour or dependencies to be mod- 
eled explicitly. 


NetPrints draws inspiration from prior work on black- 
box techniques to diagnose systems problems and index 
them with signatures to enable recall. However, Net- 
Prints’ goal of identifying how to mutate a broken con- 
figuration to fix a problem leads us to use a different ap- 
proach — decision tree based learning — compared to 
prior work. This is primarily because of the interpretable 
nature of a decision tree. Furthermore, NetPrints lever- 
ages domain-specific knowledge to construct signatures 
of networking problems. The diagnosis procedure in 
NetPrints is both state-based and signature-based. 


NetPrints Server 


Signature Configuration 
Generator Manager 
Signatures , change trees 


NetPrints Client 






Figure 1: NetPrints system design 


3 Overview of NetPrints Design 


We begin with an overview of NetPrints, before turning 
to a more detailed discussion in the sections that follow. 
Figure | depicts the client and server components 
of NetPrints, and their interaction. NetPrints has two 
modes of operation: “construction” and “diagnosis”. 

In the construction mode, the NetPrints server gath- 
ers configuration snapshots (Section 4) and network traf- 
fic features from NetPrints clients. This information 
is labeled as “good” or “bad” depending on whether 
the application run was successful or not. The Net- 
Prints server, using this information, constructs a con- 
figuration tree (Section 5) that encodes its knowledge 
of which configuration settings work. It constructs a 
change tree (Section 7) based on the before and after 
snapshots of configuration changes that fixed a problem. 
Change trees are indexed by network traffic signatures 
(Section 6) that characterize how an application run fails. 
All these are constructed on a per-application basis. 

When users experience a problem with an applica- 
tion, they invoke the diagnosis procedure. The Net- 
Prints client, which runs on the user’s machine, identi- 
fies which application to diagnose, either automatically 
(e.g., the application that last had focus) or with the help 
of the user. The client then gathers and uploads local 
configuration information and network traffic features, 
both labeled as “bad”, to the NetPrints server (step 1 in 
Figure 1). 

The NetPrints server performs diagnosis in two 
phases. In phase I, it uses the application-specific con- 
figuration tree to determine whether the client’s configu- 
ration is problematic and, if so, identifies remedial con- 
figuration mutations, which it then conveys to the client 
(step 2 in Figure 1). 

While configuration tree based diagnosis would work 
in many cases, it might fail, for instance, when there are 
“hidden” configuration parameters that impact a subset 
of the clients, so that the main configuration tree does 
not find anything amiss with the configuration of such 
clients (e.g., #4, #8, #10, and #12 in Table 1; see Sec- 
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a WGR614 VPN Client does not connect Stateful firewall was off Turn on the stateful firewall 


VPN WRT54G VPN drops connection after 3 Set MTU to 1350-1400, 
minutes uncheck “block anonymous 
internet request’, “filter 
multicast boxes” in router 
configuration 


[3_| VPN | WRTSIG | No VPN comectvty [No PPTP passthrough [tum on PPT pashirough 
VPN WRT54G No VPN connectivity double NAT, second NAT was | Switch from PPTP server to 

File” ~~ i unidirectional sharing End-host firewall is not prop- | Allow file sharing through all 

| String |__| UMNO [enycongued | eva 
File WGR614v5| No file sharing Client machine is on a do- | Put both machines either on 
Sharing main, server machine is on | the same domain or work- 

workgroup group 

Cannot connect to FTP server | Port forwarding incorrect Turn on port forwarding on 
[iomausitenomenewat [On | pon dt 

FTP WGR614 Cannot connect to FTP server | Client firewall blocking traf- | Turn on firewall rule to allow 
arrearage [ro 


VPN WRT54G PPTP server behind NAT does | IP of server is 192.168.1.109, | Use static IP outside DHCP 
server not work despite port forward- | which is inside default DHCP | range for server 
ing and PPTP passthrough al- | range of router; router’s port 
lowed forward to IPs inside default 
range of router does not work 
10 | Outlook | WRT54G Outlook does not connect via | Default IP range of router was | Change the IP range of home 
VPN to office same as that of the remote | router 
router 
Outlook | WGR614 Router not able to email logs SMTP server not configured | Setup SMTP server details 1 in 
pees ES [Remeeeeeh® [ ppety n | eroercontgraion 
12 | Outlook | Linksys Not able to send mail through | MTU value too high for re- | Reduce MTU to 1458 or 1365 
Linksys router; Belkin router | mote router, so remote router 
“epee works fine discards packets 
13 | SSH WGR614 SSH client times out after 10 | NAT table entry times out Change router or increase 


14 | Office WRTP54G | IM client does not connect to | DNS requests not resolved Turn off DNS proxy on router 
office 


15 | STEAM | WGR614 Listing game servers causes | Router misinterprets the sud- | Upgrade to latest firmware 
games connection drops den influx of data as an attack 
and drops connection 
16 | Real- BEFW11s4 | Streaming kills router Firmware upgrade caused | Downgrade to previous 
Pe ee ee Toronems SS pare 


17 | Xbox WRT54G Xbox does not connect and all | Some ports are blocked and | Set static IP address on Xbox 
games do not run NAT traversal is restricted and configure it as DMZ, en- 
able port forwarding on UDP 
88,TCP 3074 and UDP 3074, 
disable UPnP to open NAT 


18 | Xbox WRT54G Xbox works with wired net- | WPA2 security is not sup- | Change wireless security fea- 

work but not with wireless ported ture from WPA2 to WPA per- 
20 DG834GT | Camera disconnects periodi- | DHCP problem Configure static IP on the 
aa cally at midnight, router needs camera 


reboot 


ROKU_ | DIR-655 ROKU did not work with | (n/a) Change to mixed b and g 
mixed b, g and n wireless mode 
modes 


Table 1: Recent configuration-related problems in home networks. 
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tion 7 for an elaboration of #8). So in phase I, the Net- 
Prints server uses a signature of the application problem 
to identify the appropriate change tree, which has been 
constructed by focusing specifically on such problem- 
atic cases. If the change tree is unable to diagnose the 
problem either, NetPrints gives up; it is possible that the 
problem is not configuration-related. 


4 Configuration Scraper 


The configuration scraper gathers configuration infor- 
mation from the local Internet Gateway Device (IGD) 
— which we loosely refer to as the local router — the 
local client host, and possibly also from a remote host 
and network. 


4.1 Internet Gateway Configuration 


The scraper gathers two categories of IGD information: 
(1) IGD identification information: This information in- 
cludes the make, model and firmware version of the de- 
vice, which in most cases is a home router, although in 
some cases it could be a DSL or cable modem. The 
scraper obtains this information using the UPnP inter- 
face which is supported and enabled by default on most 
modern IGDs [16]. UPnP is a standard with which our 
client can obtain basic information such as the URL 
for the Web interface for the device, and the make and 
model of the device. However, if the router has UPnP 
turned off, we ask the user to manually input the IGD 
identification information. Note that the user will need 
to input this information only very rarely, i.e., when they 
install a new router that has UPnP turned off. 

(i) Network-specific configuration information: ‘The 
IGD also includes configuration information such as 
port forwarding and triggering tables, MTU value, VPN 
pass-through parameters, DMZ settings, and wireless se- 
curity settings. The scraper uses both the UPnP interface 
and the Web interface that most routers and modems pro- 
vide to glean such configuration information. On some 
of the routers we tested, the port tables from the Web 
page and the port tables from the UPnP interface were 
not kept consistent with each other. Consequently, we 
scrape and combine the tables via both interfaces. Some 
router firmware versions also allow us to scrape the max- 
imum NAT table size and the per-connection timeout for 
each table entry. These fields can be particularly useful 
in diagnosing problems such as #2 and #13 in Table 1. 

While the UPnP interface gives us access to only 
device-identifying parameters and the UPnP port for- 
warding and port triggering tables, the Web interface is 
richer but not standardized across routers. 

In particular, there is no standardized way for 
parsing the HTML to extract the (key,value) pairs 
defining the configuration. To address this problem, 
we make the observation that each configuration Web 


page of the device is typically an HTML form that 
includes a “submit” operation. We invoke this op- 
eration programmatically on each configuration Web 
page. Doing so causes the creation of an HTTP POST 
request containing all of the (key,value) pairs in an 
easy-to-parse form. For example, the body of the POST 
request might contain: submit_button=indexé& 
dhep_start=100&dhcp_num=50&dhcp_lease= 
1440. It is then straightforward to extract the various 
DHCP-related configuration settings from this string. 

While scraping Web forms, the NetPrints client asks 
for the user name and password set on the router. The 
user will need to input this information once, after which 
a cookie within the NetPrints client will remember the 
input to use every time it scrapes the Web interface of 
the router. Note that no such information is needed for 
the UPnP-based scraping. 


4.2 Local Host Configuration 


There is also much configuration information of rele- 
vance to network operation on the local client host it- 
self, such as whether the network connection is wired or 
wireless, whether TCP window scaling is on or off, and 
end-host firewall rules. We currently scrape all interface- 
specific network parameters, TCP-specific parameters 
and firewall rules from the end-host. Our implementa- 
tion uses the netsh utility available on Windows oper- 
ating systems to get this information. 


4.3. Remote Configuration 


In general, the configuration of the remote host and net- 
work also impacts the health of network applications. 
In some cases, the configuration information at the re- 
mote end may be inaccessible to us (e.g., the remote 
host might be a server in a different administrative do- 
main). In other cases, however, the remote host might 
be under the control of the same user as the local host. 
One example is communication between a client and a 
server on the same home network, say as part of a file or 
printer sharing application. Another example is when a 
user tries to access a service running in their home net- 
work from an external location, such as a user in their 
workplace accessing their home FTP server. 

If the user installs the NetPrints client on the remote 
host as well, then, using simple password-based authen- 
tication, the local NetPrints client can obtain remote host 
and network configuration information. For every ap- 
plication, the NetPrints client keeps track of all remote 
hosts that it accesses or tries to access and, if the re- 
mote site runs NetPrints under the same administration 
as the local NetPrints client, the local client collects re- 
mote configuration information. 

The impact of remote configuration on the health of 
a networked application can vary. In some instances, a 
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problem may arise because of misconfiguration at the re- 
mote end. For example, if the remote network blocks ac- 
cess to port 21, attempts to connect to an FTP server on 
that network would fail. In other instances, the remote 
configuration may not be problematic per se. Rather, 
it is the mismatch between the local configuration and 
the remote configuration that is problematic. For in- 
stance, while some users might be able to access a file 
server, others may not be able to because their creden- 
tials are not included in the access control list (ACL) on 
the server. In other words, there is a mismatch between 
the local configuration (the local user’s credentials) and 
the remote configuration (the ACL on the server). 

Once the remote configuration information has been 
obtained, it is incorporated into NetPrints’ diagnostics 
procedure in the same manner as local configuration in- 
formation. The one exception, which requires some ad- 
ditional pre-processing, is incorporating the mismatch 
between local and remote configurations, a problem we 
turn to next. 


4.4 Composing Configurations 


Since it is the combination of local and remote config- 
urations that matters in some cases, we introduce new, 
composite configuration parameters that are derived by 
combining local and remote configurations parameters. 
Conceptually, a composite parameter, C’, is a Boolean 
derived by applying a comparison operator, &), to the 
local parameter, L and a remote parameter, R. That is, 


C=L®R. 
The specific comparison operators we focus on are 
equality “=” and set membership “€’’. For example, if 


the local Windows workgroup L1 and the remote Win- 
dows workgroup R1 are the same, then C'l = 1. Else, 
C’'l is set to 0. Another example is of checking whether 
the local username L2 is part of the remote ACL f2 for 
a file sharing application. If it is G.e., D2 € R2), the 
corresponding composite parameter C’2 is set to 1. 


4.5 Reducing Composite Parameters 


Blindly comparing all pairs of local and remote config- 
uration parameters results in an explosion in the num- 
ber of composite parameters, most of which would be 
meaningless (e.g., a comparison of the local user name 
with the DHCP setting on the remote router). To limit 
the number of such composite parameters, without re- 
quiring an understanding of the semantics of the param- 
eters, Netprints (1) only uploads composites that explic- 
itly match, and (2) excludes parameters that exclusively 
have one value from the learning process. 

In our experimental setup, the configuration scraper 
captures roughly 500 configuration parameters from the 
router and 2100 from the end-host, at each of the local 
and remote ends. This yields an additional 1500 com- 
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posite parameters, after reduction is applied, and hence 
a total of (2100+500)x2+1500=6700 parameters. 


5 Configuration Trees 


Based on the labeled configuration information ob- 
tained from clients, we construct per-application deci- 
sion trees, called configuration trees, which encode Net- 
Prints’ learning of which parameter settings work and 
which do not. We start with a brief introduction to de- 
cision trees and then turn to how NetPrints constructs 
configuration trees and uses these for diagnosis. 


5.1 Decision Trees 


local.disable_spi 
0 1 NA 
(50/1) (48/0) eee 
0 1 NA 
local.filter Goce aoe 
(49/0) (73/0) 
off on NA 
a 3h 
speed (54/0) (12/0) 
1Gbps 100Mbps 


local.dmz_enable 
0 1 
— (4/0) 
0 1 
Bad 


0 1 
Good Bad 
(2/0) (2/0) 


Figure 2: Configuration tree for the VPN client applica- 
tion discussed in Section 9.2. 








NetPrints uses decision trees as a basis for performing 
configuration mutation. A decision tree (see Figure 2 
for an example) is a predictive model that maps obser- 
vations (e.g., a client’s network configuration) to their 
target values or labels (e.g., “good” or “bad”’). Each non- 
leaf node in the decision tree corresponds to an attribute 
of the observation, and the edges out of the node indi- 
cate the values that this attribute can take. Thus, each 
leaf node corresponds to an entire observation and car- 
ries a label. Given a new observation, we start at the root 
of the decision tree, walk down the tree, taking branches 
corresponding to the individual attributes of the obser- 
vation, until we reach a leaf node. The label on the leaf 
node identifies configurations as “good” or ““‘bad”’. 

There are several algorithms for decision tree learn- 
ing. We chose a widely-used algorithm, C4.5 [14], 
which builds trees using the concept of information gain. 
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The C4.5 tool starts with the root, and at each level of 
the tree chooses the attribute to split the data that re- 
duces the entropy by the maximum amount. The result 
is that the branch points (1.e., non-leaf nodes with multi- 
ple children) at the higher levels of the tree correspond to 
attributes with greater predictive power, 1.e., those with 
distinct values or ranges corresponding to distinct labels. 

When the training data is noisy (e.g., it contains mis- 
labeled samples) or there are too few samples, there is 
the risk that the above algorithm will over-fit the train- 
ing data. To address this concern, C4.5 also include a 
pruning step, wherein some branches in the tree are dis- 
carded so long as this does not result in a significant error 
with respect to the training data (a process called gener- 
alization). C4.5 uses a confidence threshold to determine 
when to stop pruning. In our implementation, we use the 
default threshold. A consequence of pruning is that, if 
the number of samples is insufficient, these samples will 
not be reflected in the decision tree. 

A decision tree has two key properties. First, it en- 
ables classification of observations that include both 
quantitative and categorical attributes. For example, the 
decision tree in Figure 7 includes quantitative attributes 
such as the WAN MTU and categorical attributes such as 
the security mode. Second, a decision tree is amenable 
to easy interpretation. It not only enables classification 
of observations, it also helps identify in what minimal 
way an observation could be mutated so as to change its 
label (e.g., from “bad” to “good”). We elaborate on this 
property in Section 5.4. The interpretability of decision 
trees, in particular, makes it an attractive alternative to 
SVMs or Bayesian classification. 


5.2 Labeling Configuration Information 


As explained in Section 4, the NetPrints client extracts 
configuration information from the local host and net- 
work as well as from the remote end. Before this in- 
formation can be fed to the NetPrints server, it has to be 
labeled as either “good” or “bad”, depending on whether 
the application in question was working or not. In gen- 
eral, it is hard to determine automatically whether an 
arbitrary application is working well. We sidestep this 
difficulty by enlisting the help of the human user to la- 
bel the application runs. If we assume that the majority 
of users are honest, then most of the configuration in- 
formation submitted to the NetPrints server will be la- 
beled correctly. As we discuss in Section 9.6, decision 
tree based learning employed by the server is robust to 
mislabeling to a large extent. Also, in Section 10.1, we 
discuss ways of reducing the burden of labeling on users. 


5.3. Configuration Manager 


The configuration manager at the NetPrints server 
uses the labeled configuration information submitted by 


clients to learn and construct per-application configu- 
ration trees, using C4.5. The tree comprises decision 
nodes, which are branch points, and leaf nodes, which 
correspond to “good” or “bad” labels. A path from the 
root to a “good” (“bad’’) leaf node indicates the parame- 
ter settings for a working (non-working) configuration. 

Figure 2 shows an example of such a configura- 
tion tree that we generated for the Microsoft Con- 
nection Manager VPN application [13] using con- 
figuration information from clients using several dif- 
ferent router devices (see Table 5). We note that 
the local .disable_spi attribute (corresponding to 
whether stateful packet inspection (SPI) is disabled) is 
the clearest, even if not a perfect, indicator of whether a 
configuration is good or bad. So it is at the root of the 
configuration tree. 

Note that a decision node in the configuration tree 
may have a branch labeled NA (not applicable), in ad- 
dition to branches corresponding to the various parame- 
ter settings (e.g.,0 and 1 with local .disable_spi). 
The NA branch is needed since some parameters may be 
absent in particular routers. 

Currently, the decision tree algorithm we use does not 
allow for incremental training of the trees, hence we use 
a cache of configurations to perform the training at each 
step. However, incremental update based algorithms ex- 
ist [17] and we plan to evaluate these in future work. 


5.4 Misconfiguration Diagnosis 


When users experience application failure, they initiate 
the diagnosis procedure on the NetPrints client. The 
NetPrints client scrapes and submits its suspect configu- 
ration information to the NetPrints server for diagnosis. 
At the server end, the configuration manager starts at the 
root and walks down the configuration tree correspond- 
ing to the application that the user is complaining about. 
If it ends at a “bad” node, it means that the client’s con- 
figuration is known to be non-working. On the other 
hand, if it ends at a “good” node, it means that the con- 
figuration tree is unable to help with the diagnosis, a case 
we consider in Section 7. 

If the client’s configuration corresponds to a known 
“bad” state, then the goal of diagnosis is to identify the 
configuration mutations that would move the configura- 
tion to a known “good” state. In general, there would 
be multiple “good” leaf nodes, so which one should we 
mutate towards? 

Intuitively, we would like to pick the mutation path 
that is easiest to traverse. The easiest path is not neces- 
sarily the one with the fewest changes. The difficulty of 
making the changes also matters. For example, chang- 
ing the router hardware (say switching from a Linksys 
router to a Netgear router) would likely be more dif- 
ficult than modifying a software-settable parameter on 
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Figure 3: Illustration of the costs of different configura- 
tion mutations. 
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the router because of the costs involved. Even among 
software-settable parameters, some changes might be 
less desirable, and hence more difficult to make, than 
others. For example, putting the client host on the DMZ, 
and thereby exposing it to external traffic, would likely 
be less desirable than say enabling port forwarding for a 
specific port. 

To determine the degree of difficulty automatically, 
NetPrints records the frequency with which various con- 
figuration parameters are modified across all clients. It 
might find, for instance, that the disable_spi param- 
eter is modified 100 times as often as the device is. 
We quantify the cost of a mutation as the reciprocal of 
the change frequency, possibly scaled by a constant fac- 
tor, of the corresponding configuration parameter. We 
might record some spurious changes, say when a mo- 
bile client moves from one network to another and mis- 
takenly thinks that its router device and various con- 
figuration settings have “changed”. However, we can 
counter the effect of mobility by hard-coding the fact 
that changing routers is a low-frequency, and therefore 
high-cost, change. Thereafter, when a client is mobile 
and associates with a new router, we infer that the corre- 
sponding changes in configuration detected by NetPrints 
are because the router changed, not because the user ex- 
plicitly changed configurations. Hence we do not in- 
crease the change frequency of the parameters. 

Figure 3 illustrates how the configuration tree is an- 
notated with costs. The cost of changing the router 
device is 100 times greater than the cost of changing 
the disable_spi setting. Some mutations are impos- 
sible to effect, so the corresponding cost is set to oo. 
For instance, it is not possible to set disable_spi to 
NA when the parameter does not exist on the router in 
question. Also, note that the cost is incurred only when 
a parameter is changed, hence the zero cost for merely 
walking up the tree. 

Given the mutation costs indicated above, we com- 
pute the cost of moving from a “bad” leaf node to a 
“good” leaf node as the sum of the costs of the muta- 
tions on the path from the former to the latter. NetPrints 
recommends the set of mutations corresponding to the 
path with the lowest cost. 
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3.5 Going Beyond Configuration Trees 


The per-application configuration trees help diagnose 
misconfigurations based on configuration information 
on which there is broad agreement across a large number 
of participating NetPrints clients. Basically, the config- 
uration manager learns about the goodness or otherwise 
of various configuration settings based on static snap- 
shots of labeled configuration information uploaded by 
clients. 

However, as noted in Section 3, diagnosis based on 
the configuration tree would not work in the case of mis- 
configurations that are exceptions to the norm. Such ex- 
ceptions could arise, for instance, from hidden configu- 
ration settings (as noted in Section 3) or from decision 
tree pruning (as explained in Section 5.1). In such cases, 
the configuration tree might suggest that the suspect con- 
figuration is “good” and hence not be in a position to 
suggest any mutations. 

To address this issue, we introduce change trees, 
which seek to learn based on dynamic information, 
1.e., configuration changes. Furthermore, to reduce the 
chances of exceptions being buried by the mass, we use 
network traffic signatures to index the change trees. 

Note, however, that multiple configuration errors 
could yield the same network signature, so a network 
signature is, in general, not as informative as the config- 
uration information itself. Hence our approach is to use 
the configuration tree as the option, with the change trees 
indexed using network signatures as the fallback option. 

We now discuss how NetPrints constructs network 
traffic signatures, and then turn to change trees. 


6 Network Traffic Signature 


We use a network traffic signature to characterize appli- 
cation runs. For instance, an application could fail be- 
cause it is unable to establish a TCP connection (SYN 
handshake failure) or because the TCP connection is re- 
set prematurely. The network traffic signature is used 
to distinguish between these failure modes. In essence, 
the signature records the symptom of the failure, which 
is used to index the change trees of the application, as 
explained in Section 7. 

The basic approach is for the NetPrints clients to ex- 
tract a set of network traffic features from a packet trace 
of the application run. The NetPrints server then applies 
learning on these features to identify the important ones, 
which are then included as part of the network traffic 
signature for that application. 


6.1 Network Traffic Feature Extractor 


The network traffic feature extractor characterizes the 
network usage of each application running on the client 
machine. In our current implementation, it uses the Win- 
pcap library and the [PHelper API on Windows to tie all 
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Feature Description 
| 1 | TCP: Three SYN no response 


TCP:RST after SYN, no data ex- | 5-tuple 
bal ~~“ hall 
TCP:RST after no activity for 2 | 5-tuple 
Chim 


TCP:RST after some data ex- | 5-tuple 
changed 


lL | UDP: Data sent but not received 


Other: Data sent but not received src-dst IP 
addr pair 


All: No data sent or received all traffic 





Table 2: Network traffic features and the unit of commu- 
nication over which the feature is extracted. Each feature 
is maintained separately for inbound and outbound di- 
rections, except for “All”, which is maintained for both 
directions together. 


observed network traffic to the individual processes, and 
hence applications, running on the client machine. For 
each running application, it extracts a set of features by 
examining its network activity. These features form the 
feature vector for the application. 

Table 2 lists the set of features we extract in the form 
of rules. Most of these features are maintained sepa- 
rately for the inbound (I) and outbound (O) directions, 
depending on whether the communication was initiated 
by the remote host or by the local host. While many of 
these features are extracted on a per-5-tuple basis (i.e., 
on per-connection basis for TCP), we combine the fea- 
tures across all connections of an application to compute 
the bits of the feature vector. Specifically, if at Jeast one 
connection of an application satisfies any of these rules, 
the corresponding bit in the feature vector is set. Note 
that it is possible for multiple bits in an application’s fea- 
ture vector to be set. Also, while all of the features we 
consider at present are binary, the feature set could be 
expanded to include non-binary features. 

We identified the set of features in Table 2 based 
on empirical observations of the ways in which an ap- 
plication’s network communication may typically fail. 
The first four features in the table capture various kinds 
of TCP-level issues that we commonly see in malfunc- 
tioning applications. Several applications and services 
such as multimedia streaming, DNS and VPN clients use 
transport protocols other than TCP. For all of these, the 
lack of connectivity in one direction often indicates a 
networking problem. Consequently, we have included 
features #5 and #6 to capture the behavior of such appli- 
cations. For both features, we use a timeout of 2 min- 
utes: if no data is received for a period of 2 minutes, 
we interpret this as a possible problem and set the fea- 
ture. Feature #7 characterizes a total loss of connectivity 


for an application using any transport protocol; problem 
#18 in Table 1, for instance, is a scenario in which our 
system would use this feature. 


Finally, we briefly discuss two issues pertaining to 
the recording of network features for an application run. 
First, since the instance of an application could run for 
an extended period of time (e.g., a Web browser could 
run for days or weeks), we only consider network traf- 
fic features over a short window of time (typically a few 
minutes long) extending into the recent past. Second, 
extracting the network traffic feature for an application 
run requires capturing its traffic. One possibility is to 
run traffic capture continuously, which has the advantage 
that a record of the traffic will be available even when an 
application run failed. 


To reduce the overhead of the NetPrints client with 
such traffic continuous capture, we split the network 
signature generator into two parts: a lightweight, con- 
tinuously running component to capture selected packet 
headers and connection-to-process bindings, and a rel- 
atively more CPU-intensive component that creates the 
feature vector from the trace only when needed. Mea- 
surements of our implementation show that the over- 
head is low (0.8% CPU load) on a 1.8 GHz laptop 
PC running Windows Vista Enterprise, while streaming 
video over the Internet and simultaneously synchroniz- 
ing email folders with the server. 


6.2 Network Signature Generator 


The NetPrints client records and uploads the feature vec- 
tor for an application run to the NetPrints server, either 
when the user invokes NetPrints to complain about a 
non-working application or when the user is prompted, 
as explained in Section 5.2. In either case, the feature 
vector is labeled as “good” or “bad”, just as the ac- 
companying configuration information is. The NetPrints 
server then applies learning on the mass of labeled fea- 
ture vectors for an application to identify the most signif- 
icant features, i.e., ones that correspond most strongly to 
the fate of an application run. These significant features 
define the network signature of the application. 


The signature generator, again, uses the C4.5 algo- 
rithm to learn the network signatures, which are repre- 
sented as per-application signature trees. However, un- 
like with learning applied to configuration information, 
interpretability is not necessary for signature construc- 
tion (since there are no mutations to perform), so we 
could have also used a different learning algorithm such 
as SVM. Figure 5 shows the signature tree generated for 
an FTP application, where 2 features, out of the 13 in all, 
are sufficient to capture the network problems seen. 
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7 Change Trees 


As noted in Section 5.5, change trees are used as the 
fallback option when the configuration tree fails to diag- 
nose a problem. To understand why configuration tree 
based diagnosis might fail, consider problem #8 in Ta- 
ble 1. The FTP server in question enables passive mode 
by default, so that all connections are initiated at the 
client end. However, in a small number of cases, the 
server may disable passive mode, 1.e., only the server 
can initiate FTP data connections. The client will disal- 
low these connections unless the client-side firewall has 
been configured to let them in. Note that the application- 
specific configuration parameter that captures the infor- 
mation that the server has disabled passive FTP is “hid- 
den” from NetPrints since, in general, NetPrints is not 
in a position to scrape such parameters. Nevertheless, 
there are non-hidden configuration parameters (the fire- 
wall parameters on the client, in this instance) that could 
be manipulated to fix the problem. 

Since the discriminating parameter is hidden, it is hard 
to tell apart the majority of clients that are configured for 
passive mode from the minority that are configured for 
active mode. So the majority prevails and the configura- 
tion tree learns to ignore the firewall settings since these 
are not of relevance for the majority of clients (1.e., FTP 
works for such clients regardless of the firewall settings). 
So when an active FTP connection to a client fails, the 
configuration tree would not find anything amiss with 
its configuration, 1.e., it will find the configuration to be 
“good” and leave no scope for remedial action. 

Change trees try to address this problem by isolating 
the cases where a traversal of the configuration tree ends 
up in leaf nodes labeled as “good” and then applying 
learning separately on these. For the purposes of this 
learning, the suspect configurations (which the config- 
uration tree thinks of as “good’’) are labeled as “bad”. 
Since we also need configurations labeled as “good”’ to 
perform learning, the NetPrints client in such cases looks 
for any out-of-band configuration changes that are made 
and, when such a change is detected, it prompts the user 
to determine whether the application problem has now 
been resolved. If and when the user indicates that the 
problem has been resolved, it uploads a “good” configu- 
ration to the NetPrints server. 

The NetPrints server uses the C4.5 algorithm to learn 
a decision tree — the change tree — based on the 
change information: the “before” configurations la- 
beled as “bad” and the “after” configurations labeled as 
“good”. To isolate the relevant cases and minimize the 
mixing of unrelated problems, we use the network sig- 
nature corresponding to application failure to index the 
change trees. So, in effect, each “bad” leaf node in the 
signature tree can point to a separate change tree. 

Each change tree is also traversed the same way as 
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the main configuration tree. If a traversal of the relevant 
change tree also ends in a leaf node labeled as “good”, 
NetPrints gives up. It could be that NetPrints does not 
have sufficient information to identify the misconfigura- 
tion or that the problem is not configuration-related. 


8 Summary of NetPrints Operation 


In summary, NetPrints performs the following steps in 
the construction and diagnosis phases. 


Construction Steps: 

1) The NetPrints clients upload labeled configuration 
information and network feature vectors to the NetPrints 
server, either when users invoke NetPrints for diagnosis 
or are prompted by NetPrints (the latter happens for a 
small fraction of application runs). 

2) The NetPrints server feeds the labeled configura- 
tion information into the C4.5 decision tree algorithm to 
construct an application-specific configuration tree. It 
feeds the labeled network feature vector to the same al- 
gorithm to learn an application-specific signature tree. 

3) During the diagnosis phase (see below), if the 
traversal of the configuration tree with a suspect con- 
figuration terminates in a “good” leaf node, then this 
configuration, now labeled as “bad”, is fed into the 
application-specific change tree construction procedure. 

4) Furthermore, the NetPrints client prompts the user 
to determine if future configuration changes, if any, help 
restore the application to a working state. If so, the cor- 
responding configuration, labeled as “good’’, is fed into 
change tree construction at the NetPrints server. 


Diagnosis Steps: 

1) When the user encounters a problem and invokes 
diagnosis, the NetPrints client uploads configuration in- 
formation, along with the network feature vector for the 
affected application, to the NetPrints server. 

2) The NetPrints server traverses the configuration 
tree with the suspect configuration submitted by the 
client. If this traversal ends in a “bad” leaf node, Net- 
Prints identifies the set of configuration mutations, with 
the lowest cost, that would help move the configuration 
to a “good” state. 

3) If the traversal of the configuration tree ends in a 
“good” leaf node, the NetPrints server first computes the 
signature of the failed application run based on the net- 
work feature vector submitted by the NetPrints client. 

4) The NetPrints server uses the signature to iden- 
tify the relevant change tree and then traverses this tree 
with the suspect configuration. If this traversal ends in a 
“bad” leaf node, then the NetPrints server uses the same 
procedure as indicated above to identify mutations. 

5) However, if the traversal of the change tree ends in 
a “good” leaf node, the NetPrints server gives up. 
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Config | Feature Config | Signature 
scraper | extractor || manager | generator 





P39 | 701 1767 | 460 


Table 3: Lines of code for NetPrints prototype. 


9 Experimental Evaluation 


Our experimental evaluation of NetPrints is based on the 
prototype we have implemented on Windows Vista SP1, 
using a combination C# and C++. Table 3 summarizes 
some information on the implementation; for C4.5, we 
used a standalone distribution [14]. 

We deployed the NetPrints client on 4 hosts behind 
separate broadband connections. Given this small scale 
of our current deployment, we used hosts on a separate 
testbed to scale up the effective size of the deployment, 
as we elaborate on below. The data gathered from the 
testbed was used in the “construction” phase of Net- 
Prints during which the NetPrints server, which ran on 
a separate host, learnt the configuration, signature, and 
change trees. The “diagnosis” phase was initiated from 
one of the 4 broadband hosts and involved communica- 
tion with the NetPrints server to perform diagnosis. 


9.1 Setup and Methodology 


We evaluated NetPrints with 4 applications: Microsoft’s 
VPN client, a Perl-based FTP client, Windows Vista file 
sharing, and Xbox Live. These applications were run 
both on our testbed (construction phase) and a separate 
set of broadband hosts (diagnosis phase). Our testbed in- 
cluded a Windows Vista laptop (two in the case of the file 
sharing application), each running the NetPrints client, 
and also an Xbox 360 gaming console, all of which were 
uplinked via a home router and a DSL broadband mo- 
dem. We also had 4 other hosts, including 2 at peo- 
ple’s homes, on separate broadband connections, each 
running the NetPrints client from which diagnosis was 
initiated. Finally, for the FTP application, we also had 
an external machine running the client, not on a broad- 
band network, that connected to one of the broadband 
hosts via the Internet. 

For diversity, we used 7 different routers from Net- 
gear, Linksys, D-Link, and Belkin (Table 5), in turn, as 
the home router in our testbed. To obtain greater di- 
versity, as one might see with a large-scale deployment, 
we varied the configuration settings on these routers, re- 
running the applications each time. Note that although 
we varied these configuration settings artificially, we ran 
the applications and NetPrints just as they would be run 
in the real world. 

We identified 11 parameters (Table 4) and learnt vari- 
ations in their settings based on a study of online discus- 
sion forums. Even with this subset of parameters, many 


Router parameters: 

MTU {1100, 1200, 1300, 1400, 1500 bytes}: sup- 
ported by all routers except Belkin F5D7230. 
VPN-specific parameters {on, off}: the D-Link 
router supports pass-through for IPSEC and PPTP, 
while the Linksys routers support these and also L2TP 
pass-through. 

Stateful Packet Inspection (SPI) {on,off}: supported 
by all routers except Linksys WRT54G and Belkin 
F5D7230. 

Wireless security parameters {none, WEP, WPA, 
WPA2}: all modes supported by all routers, except 
that the Netgear WGR614v5 does not support WPA2. 
DMZ {on, off}: supported by all routers. 

UPnP {on, off}: supported by all routers. 

NAT type {symmetric, full cone, restricted cone}: 
only supported by Netgear WGR614v7 and D-Link 
DIR-635. 

Port forwarding for FTP {on, off}: supported by all 
routers but only used for our FTP experiment. 
End-host parameters: 

Domain or Workgroup joined 

Current user {Administrator, Guest, Everyone, 
other} 

Windows Vista firewall rules {on, off} 





Table 4: Parameters varied in our experiments 


configurations are possible (e.g., 4800 with the D-Link 
DIR-635 router). So for each application, we only ex- 
perimented with a subset of these variations. 


To automate the data collection process, we used Au- 
toHotKey [1], a GUI scripting tool. To change con- 
figuration settings on the router, we used customized 
HTTP POST messages. To configure end-hosts, we 
manually changed the relevant parameters. For every 
configuration setting, we ran the applications and used 
simple application-specific heuristics to automatically 
determine whether the application worked (labeled as 
““sood’) or not (“bad”). These heuristics varied based on 
the application. For example, when the VPN client suc- 
cessfully connects, opening the VPN application’s win- 
dow displays the status of the connection. If the VPN 
connection was unsuccessful, then the same window 
shows the user an option to re-initiate the connection. 
Using AutoHotKey, we captured exactly which kind of 
message followed our attempt to set up the VPN con- 
nection, thereby determining if the application worked 
or not. 

We recreated all of the problems related to VPN 
clients, file sharing, FTP, and the Xbox shown in Table 1, 
except for #2 and #6. In addition, our testbed itself pre- 
sented new problems. 
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The diversity of configurations that we artificially in- 
duce in our testbed facilitates the construction of the 
application-specific configuration, signature, and change 
trees. However, it is hard to know how much diversity 
there would be in practice, in the absence of a large-scale 
deployment. Nevertheless, in Section 9.6, we demon- 
strate NetPrints’ robustness to noisy data. 

Finally, there is no standardized nomenclature for 
router configuration parameters. The parameter names 
vary across routers even when the functionality involved 
is the same. We avoid any manual steps to establish 
correspondence across routers or segregate information 
based on router model. If two router models happen to 
use the same parameter name, NetPrints will recognize 
and incorporate this in its learning process. Otherwise, 
it will treat the parameters as separate and unrelated. As 
standards such as HNAP [2] become prevalent, duplica- 
tion would be reduced, resulting in more compact and 
better interpretable configuration trees. 


9.2 Microsoft Connection Manager 


The Microsoft Connection Manager (CM) [13] is a 
PPTP-based VPN client. For our evaluation, we used the 
7 different routers in turn, varying the settings on each 
and then using CM to try connecting to an external VPN 
server. Table 5 shows the number of “good” and “bad” 
cases recorded with each router through this process. 

Figure 2 shows the configuration tree for CM gener- 
ated by the NetPrints server. Of all the configuration 
parameters, the algorithm picked disable_spi, 
pptp_pass, filter, ethernet. speed, 
ipsec_pass and 12tp_pass as the discerning 
ones. The numbers at every leaf node are of the form 
(x/y), where x is the total number of data points that the 
path from root to that leaf captures, and y is the number 
of misclassifications on that path. 

We can explain the structure of the tree as fol- 
lows. Only the Netgear routers support the specific 
disable_spi parameter. or these routers, CM 
works if disable_spi is not set and does not work 
if disable_spi is set, irrespective of the other pa- 
rameter settings. On one of the runs involving the 
Netgear WGR614v5 router, CM failed even though 
disable_spi was not set, explaining the one misclas- 
sification on this path. 

If disable_spi is not applicable, as for the 
Linksys, D-Link and Belkin routers, the next parame- 
ter that the tree learns is ppt p_pass, which 1s available 
only on the Linksys routers. When pptp_pass=1, CM 
works with all three Linksys routers. If pptp_pass=0, 
there are further conditions, depending on the specific 
Linksys router. Finally, pptp_pass=NA for the D- 
Link and Belkin routers, through which CM works re- 
gardless of the settings. The alg_pptp parameter on 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 


the D-Link DIR-635, which is supposed to control PPTP 
pass-through, is apparently a no-op. 

Next, the tree looks at filter, the stateful packet in- 
spection parameter on the Linksys WRT310N and DD- 
WRT routers. The WRT54G does not support this op- 
tion, so all configurations with filter=NA, Le., all 
WRT54G configurations with ppt p_pass=0, are bad. 

The next parameter in the tree, on the £i11lter=off 
branch, is ethernet . speed, an interface-specific pa- 
rameter on the end-host. This is a little counter-intuitive 
but explainable. The only gigabit ethernet router we 
used was the WRT310N. Instead of using the model 
name to distinguish between the WRT310N and the DD- 
WRT routers, the C4.5 algorithm picked the ethernet 
speeds instead, since this has the same discriminating 
power as the model name in this case. This illustrates 
that learning is data-driven rather than based on intu- 
ition. If data were available from more routers support- 
ing gigabit ethernet, we believe that C4.5 would have 
fallen back to the model name to differentiate among the 
various routers. 

On the WRI3ION (ethernet .speed=1Gbps), 


if filter=off, CM _ works _ irrespective’ of 
the other parameters. On the DD-WRT 
(ethernet . speed=100Mbps), CM’s — success 


depends on whether the client is placed on the DMZ. 
In particular, if the client is not on the DMZ, then CM 
works only if ipsec_pass=0 and 12tp_pass=0. 
We were unaware of this restriction until NetPrints 
constructed its configuration tree. 

Next, we deployed the NetPrints client on 4 
broadband networks using misconfigured Linksys 
WRT54G and DD-WRT, and Netgear WGR614v5 and 
WGR614v7 routers. When CM was invoked but the 
VPN connection failed, the user pressed the “diagnose” 
button on the NetPrints client. The NetPrints server 
then used its mutation algorithm to identify remedial 
configuration changes, which were then conveyed to 
the client. For the Netgear routers, the fix was to set 
disable_spi=0, whereas for the Linksys routers, it 
was to set pptp_pass=1. The NetPrints client auto- 
matically applies these fixes to the router using an HTTP 
POST to the corresponding Web form on the router. 

This case study shows that NetPrints’ configuration 
tree has automatically captured application behaviour 
with a large number of configuration settings across 7 
routers and the client host, using a small number of 
branch points (only 7, in this case) in an intuitive rep- 
resentation. The tree also flagged configuration-related 
problems that we were unaware of previously. 


9.3. Perl-based FTP Client 


Users often set up FTP servers within their home net- 
works so that they can have easy access to data on 
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Table 5: A summary of the number of configuration settings we obtained from each router for VPN, FTP, and Xbox 
experiments. A “JV” lists the number of good configurations, and a “x” lists bad configurations. Cases where a 
particular router was not used with an application are marked with ‘“—”’. 
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Figure 4: NetPrints configuration tree for the FTP client. 
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Figure 5: NetPrints change tree for the FTP client. 


their home computers from remote locations. However, 
the online discussions forums include several user com- 
plaints about the FTP service not running as expected 
when behind a NAT (e.g., #7 and #8 in Table 1). 

To investigate #8, in particular, we evaluated Net- 
Prints when a Perl-based FTP client running on a remote 
machine tries to connect to an IIS FIP server [3] run- 
ning on a home network behind a NAT. Besides varying 
the router configuration settings, we also manually set 
and reset an application-specific parameter on the FTP 
client that determined whether the client used passive- 
or active-mode FTP. This corresponds to the hidden con- 
figuration example discussed in Section 7. 

Figure 4 shows the NetPrints configuration tree, indi- 
cating the various server-side router settings (depending 
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Figure 6: NetPrints configuration tree for file sharing. 








on the router model) needed for FTP to work. Since 
variable names for the same functionality vary based 
on the router, the tree has learnt three different variable 
names to capture the state of the DMZ (dmz_enable, 
dmz_enabled, and dmz_enable_1). 


Note, however, that the misclassification count for 
most of the leaf nodes in the figure is significant. To 
understand why, consider the network signature and 
change trees shown in Figure 5. When the client uses 
active FTP, all of the server’s connection attempts to 
the client fail, unless a firewall rule on the client host is 
enabled for allowing incoming TCP connections to the 
FTP client (this rule is disabled by default). The network 
signature for this problem has the “Inbound:Three SYN 
no response” feature set, since the client’s firewall drops 
incoming connection attempts from the FTP server. Fig- 
ure 5 also shows the change tree corresponding to this 
signature, which essentially says that the above firewall 
rule should be enabled. 


While we used a Perl-based FTP client in this exper- 
iment for ease of automation, similar hidden configura- 
tion parameters exist in other clients. For example, IE 
7.0 has a parameter to “turn off passive FTP connec- 
tions”, which, if set, would result in similar problems 
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and call for similar fixes as those discussed above. 


9.4 Windows File Sharing 


Home users often use file sharing within the home net- 
work. Online forums contain several complaints related 
to file sharing in Windows Vista, often caused by end- 
host configuration errors (e.g., #5 and #6 in Table 1). 

To investigate these, we set up an experiment where 
a client host in our home network testbed tried to ac- 
cess a folder on a server host in the same home network. 
On both the client and the server, we varied the firewall 
settings, and the domain or workgroup that the machine 
was joined to. On the server, we varied the access con- 
trol list (ACL) of users allowed to access the folder, and 
on the client, we varied the identity of the user who tried 
to access the folder. In all, we gathered data for 313 
different configurations. 

Figure 6 shows the configuration tree generated by 
NetPrints. In a nutshell, the configuration tree tells us 
that file sharing works if (a) the server-side firewall al- 
lows file sharing, and (b.1) either the special user “ev- 
eryone”’ is a member of the folder’s ACL or the current 
user on the client is a member of the folder’s ACL, or 
(b.2) the special user “guest” is a member of the server’s 
ACL list and the current user on the client is not a local 
user on the server. 

This last point, b.2, 1s interesting since it suggests that 
the special user “guest” includes all users except the lo- 
cal users on the host machine. This is counter-intuitive 
since it means that guest users can, depending on the 
policy, have greater access than local users. We con- 
firmed with experts within Microsoft that this is indeed 
expected behavior. 


9.5 Xbox Live 


Xbox Live [20] is a service that allows Xbox users to 
play multi-player games, chat, and interact over the In- 
ternet. One issue was that we could not run the NetPrints 
client directly on the Xbox since the consumer Xboxes 
are not user-programmable. For the sake of our exper- 
iments, we emulated a NetPrints client on the Xbox by 
instead running the client on a PC that is able to monitor 
all of the Xbox’s network communication. 

For this experiment, we gathered data for the Netgear 
WGR614v5 and the Linksys WRT54G routers, as indi- 
cated in Table 5. 

Figure 7 shows the configuration tree generated by 
NetPrints. NetPrints learned three configuration rules. 
First, to make the NAT open, the router needs to enable 
UPnP. Second, Xbox 360 requires the router MTU to be 
greater than 1300 to enable connectivity to Xbox Live. 
Third, the Xbox wireless adapter could not connect to a 
wireless network if the security mode used was WPA2. 
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Figure 7: NetPrints configuration tree for Xbox Live. 
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Figure 8: Sensitivity to mislabeled configuration data. 


NetPrints’ findings correspond to the suggested con- 
figuration fixes for #18 and #19 in Table 1, except for 
the MTU fix. We found out through support sites that 
Xbox Live requires the MTU to be set to 1365 bytes or 
larger. However, given that the data from our experi- 
ments, which formed the basis for NetPrints’ learning, 
only had the MTU set to one of five values, the best in- 
ference we could make was that the MTU should be set 
to larger than 1300 bytes. 


9.6 Robustness Tests 


While our experiments have used clean and diverse data, 
in reality, configurations could be mislabeled and have 
limited diversity. Hence, we perform experiments to 
evaluate the robustness of the configuration trees to var- 
ious conditions not found in our experimental data. 


9.6.1 Mislabeled Configurations 


In a deployed system, configurations uploaded to the 
server will not always be labeled correctly. Mislabeled 
configurations could potentially lead to troubleshooting 
a problem incorrectly, such as identifying a bad config- 
uration as a good one. To evaluate the sensitivity of our 
configuration decision trees to mislabeling, we started 
with a known, correct set of labeled configurations and 
their associated decision trees. We then chose a ran- 
dom percentage p of those configurations and mislabeled 
them, flipping their labels from good to bad and vice 
versa. From this set with mislabeled configurations, we 
again generated decision trees and compared them with 
the original trees generated using correct labels. 

Figure 8 shows the results of this experiment on the 
configurations for three applications: VPN (CM), File 
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Sharing and Xbox. The x-axis shows the percentage 
of mislabeling of configurations, and the y-axis shows 
the percentage of configurations incorrectly labeled in 
the decision tree based upon the mislabeled configura- 
tions. Each point represents the average across 100 tri- 
als. The VPN, File Sharing, and Xbox curves are sim- 
ilar and therefore difficult to distinguish. The VPN(x4) 
curve shows the effect of mislabeling for CM when the 
tree learning used four times as much data as from our 
testbed. 

The results indicate that the applications are fairly re- 
silient to mislabeling. While an insistence on no errors 
(0%) can only tolerate 2-4% mislabeling, allowing a 
1% error (i.e., returning an incorrect configuration fix 
for up to 1 out of 100 diagnoses) allows tolerating 13-— 
17% mislabeling. When more than 20% of configura- 
tions are mislabeled, though, the resulting decision trees 
overfit substantially, resulting in a high error rate. We 
also found that the effect of mislabeling diminishes sig- 
nificantly with a larger number of data points. For the 
VPN(x4) experiment, the tree tolerates 9% mislabeling 
(0% error) and 26% mislabeling (1% error), making it 
considerably more tolerant than the tree with the smaller 
configuration set. 

Note that our methodology is not performing cross- 
validation on the data with training and testing sets. The 
reason is that we are not using the decision trees as clas- 
sifiers. In other words, NetPrints does not use decision 
trees to classify or predict whether a configuration is 
good or bad — all configurations from the client already 
have labels (“good” or “bad’’) associated with them. The 
mislabeling experiment performs an extrinsic evaluation 
of the problem in terms of the utility of identifying an 
appropriate configuration mutation for a diagnosis in the 
face of incorrect labels. 


9.6.2 Reduced Diversity 


The configurations from our testbed experiments are 
roughly uniform in distribution in terms of the settings 
of the various parameters. In practice, the distribution is 
likely to be less diverse, with some settings much more 
prevalent than others (e.g., SPI might be disabled in 90% 
of configurations). In particular, the default configura- 
tion for a device, with an incorrect setting for a parame- 
ter, is likely to be prevalent, as is the resulting working 
configuration after correction. 

Does low diversity further change the sensitivity of 
the decision trees to mislabeling? For each of the VPN, 
File Sharing and Xbox applications, we chose two con- 
figurations representing a default bad configuration and 
a default good configuration. We then introduced dupli- 
cates of those defaults to create low diversity. We varied 
the percentage of identical configurations from 0—-95%, 
learnt the decision tree, and measured the extent of mis- 





labeling similar to Section 9.6.1. For all of the applica- 
tions, the effect of mislabeling was the same as with the 
original distribution of configurations. 


10 Discussion 


We now discuss a few broad challenges for NetPrints. 


10.1 Reducing the Burden of Labeling 


As noted in Section 5.2, NetPrints enlists the help of 
users to perform labeling of configurations (and also of 
network traffic traces). NetPrints employs several sim- 
ple ideas to gather rich and accurate labeled data while 
minimizing the burden on users. 

The labeling of “bad” configurations happens implic- 
itly, as a by-product of a user invoking NetPrints for di- 
agnosis when experiencing an application failure. Thus, 
it is only for having the “good” configurations labeled 
that the user’s help must be enlisted explicitly. 

However, prompting the user to label each run of an 
application as “good” or “bad” would likely be oner- 
ous and perhaps also provoke deliberately dishonest be- 
haviour from an irritated user. So, in NetPrints, we only 
prompt each user for a small fraction of the application 
runs invoked by that user, with the expectation that, with 
a minimal burden placed on them, users would likely be 
honest while labeling. Given the participation of a large 
number of users, NetPrints is still able to accumulate a 
large volume of labeled configuration information, even 
while keeping the burden on any individual user low. 

Furthermore, even the occassional prompting of a user 
is modulated so as to yield useful data with high like- 
lihood. First, since the effective application of learn- 
ing would require a mix of both “good” and “bad” data, 
users are prompted more frequently (with the hope of 
obtaining more data points labeled as “good’’) when the 
system is accumulating more “bad” data points because 
of users invoking NetPrints frequently to diagnose prob- 
lems. Second, a user is more likely to be prompted when 
there has been a recent local configuration change. This 
policy increases the likelihood of novel information be- 
ing fed into the learning process. 


10.2. Preserving Privacy 


Privacy is a key concern for NetPrints. Simply excluding 
privacy-sensitive configuration parameters such as user- 
names and passwords from the purview of NetPrints is 
not sufficient. Even the ability to tie back to the origin 
host (identified, say, by its IP address) configuration data 
uploaded to the NetPrints server could be problematic. 
For instance, knowledge of misconfigurations on a host 
could leave it vulnerable to attacks. 

In ongoing work, we are working on a distributed 
aggregation system aimed at balancing two conflicting 
goals: enabling nodes to contribute data anonymously 
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while still enforcing tight bounds on the ability of ma- 
licious nodes to pollute the aggregated data. Thus, if a 
majority of nodes is honest, the aggregated data would 
be mostly accurate. While the details of this aggrega- 
tion system are out of the scope of the present paper, we 
believe that NetPrints could directly use such a system. 


10.3. Bootstrapping NetPrints 


A participatory system such as NetPrints faces interest- 
ing challenges in bootstrapping its deployment. There is 
a chicken-and-egg problem in that users are unlikely to 
participate unless the system is perceived as being valu- 
able in terms of its ability to diagnose problems, which 
in turn depends on the contribution of data by the partic- 
ipating users’ machines. Even if this dilemma were re- 
solved, there is still the challenge that users might resort 
to greedy behaviour, installing and running NetPrints 
only when they need to diagnose a problem and turn- 
ing it off at other times, thereby starving the system of 
the data it needs to perform diagnoses effectively. 

One could devise incentive mechanisms to encourage 
user participation. A complementary mechanism, which 
we are pursuing, is to bootstrap NetPrints using infor- 
mation learned via experiments in a laboratory testbed. 
This is similar to the methodology used for the evalua- 
tion presented in Section 9. While the richness of the 
testbed data would have a direct bearing on NetPrints’ 
learning and hence its ability to diagnose problems, such 
an approach could help bootstrap NetPrints to the point 
where users perceive enough value to start participating. 


11 Conclusion 


We have described the design and implementation of 
NetPrints, a system to automatically troubleshoot home 
networking problems caused by misconfigurations. Net- 
Prints uses decision tree-based learning on labeled con- 
figuration information and traffic features from a popula- 
tion of clients to build a shared repository of knowledge 
on a per-application basis. We report experimental re- 
sults for a few applications in a laboratory testbed and a 
small-scale deployment. Our ongoing work focuses on 
scaling up the deployment and addressing privacy issues. 
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Abstract 


Reducing the energy consumption of PCs is becoming in- 
creasingly important with rising energy costs and environmen- 
tal concerns. Sleep states such as S3 (suspend to RAM) save 
energy, but are often not appropriate because ongoing network- 
ing tasks, such as accepting remote desktop logins or perform- 
ing background file transfers, must be supported. In this paper 
we present Somniloquy, an architecture that augments network 
interfaces to allow PCs in S3 to be responsive to network traf- 
fic. We show that many applications, such as remote desktop 
and VoIP, can be supported without application-specific code 
in the augmented network interface by using application-level 
wakeup triggers. A further class of applications, such as in- 
stant messaging and peer-to-peer file sharing, can be supported 
with modest processing and memory resources in the network 
interface. Experiments using our prototype Somniloquy imple- 
mentation, a USB-based network interface, demonstrates en- 
ergy savings of 60% to 80% in most commonly occuring sce- 
narios. This translates to significant cost savings for PC users. 


1 Introduction 


Many personal computers (PCs) remain switched on 
for much or all of the time, even when a user is not 
present [23], despite the existence of low power modes, 
such as sleep or suspend-to-RAM (ACPI state $3) and 
hibernate (ACPI state S4) [1]. The resulting electricity 
usage wastes money and has a negative impact on the 
environment. 

PCs are left on for a variety of reasons (see Section 2), 
including ensuring remote access to local files, main- 
taining the reachability of users via incoming email, in- 
stant messaging (IM) or voice-over-IP (VoIP) clients, file 
sharing and content distribution, and so on. Unfortu- 
nately, these are all incompatible with current power- 
saving schemes such as S3 and S4, in which the PC does 
not respond to remote network events. Existing solutions 
for sleep-mode responsiveness such as Wake-On-LAN 
(WoL) [18] have not proven successful “in the wild” for 
a number of reasons, such as the need to modify applica- 


tion servers or configure network hardware. A few initial 
proposals suggest the use of network proxies [4, 7, 11] 
to perform lightweight protocol functionality, such as re- 
sponding to ARPs. However, such a system too requires 
significant modifications to the network infrastructure, 
and to the best of our knowledge such a prototype has 
not been described in published form (see Section 6 for 
a full discussion). 

In this paper, we present a system, called Som- 
niloquy', that supports continuous operation of many 
network-facing applications, even while a PC 1s asleep. 
Somniloquy provides functionality that is not present in 
existing wake-up systems. In particular, it allows a PC to 
sleep while continuing to run some applications, such as 
BitTorrent and large web downloads, in the background. 
In existing systems, these applications would stop when 
the PC sleeps. 

Somniloquy achieves the above functionality by em- 
bedding a low power secondary processor in the PC’s 
network interface. This processor runs an embedded op- 
erating system and impersonates the sleeping PC to other 
hosts on the network. Many applications can be sup- 
ported, either with or without application-specific code 
“stubs” on the secondary processor. Applications sim- 
ply requiring the PC to be woken up on an event can be 
supported without stubs, while other applications require 
stubs but in return support greater levels of functionality 
during the sleep state. 

We have prototyped Somniloquy using a USB-based 
low power network interface. Our system works for 
desktops and laptops, over wired and wireless networks, 
and is incrementally deployable on systems with an 
existing network interface. It does not require any 
changes to the operating system, to network hardware 
(e.g. routers), or to remote application servers. We have 
implemented support for applications including remote 
desktop access, SSH, telnet, VoIP, IM, web downloads 


'somniloquy: the act or habit of talking in one’s sleep. 
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and BitTorrent. Our system can also be extended to sup- 
port other applications. We have evaluated Somniloquy 
in various settings, and in our testbed (Section 5) a PC in 
Somniloquy mode consumes 11x to 24x less power than 
a PC in idle state. For commonly occurring scenarios this 
translates to energy savings of 60% to 80%. 

We make the following contributions in this paper: 


e We present a new architecture to significantly re- 
duce the energy consumption of a PC while main- 
taining network presence. This is accomplished 
without changes in the network infrastructure. 

e We show that several applications — BitTorrent, 
web downloads, IM, remote desktop, etc. — can 
consume much less energy. This is achieved with- 
out modifying the remote application servers. 

e We present and empirically validate a model to pre- 
dict the energy savings of Somniloquy for various 
applications. 

e We demonstrate the feasibility of Somniloquy via a 
prototype using commodity hardware. This proto- 
type is incrementally deployable, and saves signifi- 
cant energy in a number of scenarios. 


2 Motivation 


Prior studies have shown that that users often leave 
their computer powered on, even when they are largely 
idle [4]. A study by Roberson et. al. [23] shows that in 
offices, 67% of desktop PCs remain powered on outside 
work hours, and only 4% use sleep mode. In home envi- 
ronments, Roth et. al. [24] show that average residential 
computer is on 34% of the time, but is not being actively 
used for more than half the time. 

To uncover the reasons why people do not use sleep 
mode, we conducted an informal survey. We passed it 
among our contacts who in turn circulated it further. We 
had 107 respondents from various parts of the world, of 
which 58 worked in the IT sector. 30% of the respon- 
dents left at least one machine at home on all of the time, 
and 75% of the respondents left at least one work ma- 
chine on even when no one was using it. 

Among the people who left their home machine pow- 
ered on, 29% did so for remote access, 45% for quick 
availability and 57% for applications running in the back- 
ground, of which file sharing/downloading (40%) and 
IM/e-mail (37%) were most popular. In the office envi- 
ronment, 52% of respondents left their machines on for 
remote access, and 35% did so to support applications 
running in the background, of which e-mail and IM were 
most popular (47%). 

Although this survey should not be regarded as repre- 
sentative of all users, and is not statistically significant, it 
does highlight two important points. First, a number of 
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Figure 1: Somniloquy augments the PC network inter- 
face with a low power secondary processor that runs an 
embedded OS and networking stack, network port filters 
and lightweight versions of certain applications (stubs). 
Shading indicates elements introduced by Somniloquy. 
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PCs don’t go to sleep even when they are unused. Sec- 
ond, significant energy savings can be achieved if only a 
few applications — remote reachability, file sharing, file 
downloads, instant messaging, e-mail — can be handled 
when the PC 1s asleep. 


3 The Somniloquy Architecture 


Our primary aims during the development of Somnilo- 
quy were: 


e to allow an unattended PC to be in low power 
S3 state while still being available and active for 
network-facing applications as if the PC were fully 
on; 

e to do so without changing the user experience of the 
PC or requiring modification to the network infras- 
tructure or remote application servers. 


We accomplish these goals by augmenting the PC’s 
network interface hardware with an always-on, low 
power embedded CPU, as shown in Figure 1. This sec- 
ondary processor has a relatively small amount of mem- 
ory and flash storage * which consumes much less power 
than if it were sharing the larger disk and memory of the 
host processor. It runs an embedded operating system 
with a full TCP/IP networking stack, such as embedded 
Linux or Windows CE. The flash storage is used as a 
temporary buffer to store data before the data is trans- 
ferred in a larger chunk to the PC. A larger flash on the 
secondary processor allows the PC to sleep longer (Sec- 
tion 3.2. This architecture has a couple of useful prop- 
erties. First, it does not require any changes to the host 
operating system, and second, it can be incrementally de- 
ployed on existing PCs using a peripheral network inter- 
face (Section 4). 


Our prototype had 64 MB DRAM and 2 GB of flash. 
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The software components of Somniloquy and their in- 
teractions are illustrated in Figure 2. The high-level oper- 
ation of Somniloquy is as follows: When the host PC is 
powered on, the secondary processor does nothing; the 
network stack on the host processor communicates di- 
rectly with the network interface hardware. When the PC 
initiates sleep, the Somniloquy daemon on the host pro- 
cessor captures the sleep event, and transfers the network 
state to the secondary processor. This state includes the 
ARP table entries, IP address, DHCP lease details, and 
associated SSID for wireless networks 1.e. MAC- and IP- 
layer information. It also includes details of what events 
the host should be woken on, and application-specific de- 
tails such as ongoing file downloads that should continue 
during sleep. Following the transfer of this information 
to the secondary processor, the host PC enters sleep. 

Although the host processor is asleep, power to the 
network interface and the secondary processor is main- 
tained [1]. To maintain transparent reachability to the 
host while it is asleep, the secondary processor imper- 
sonates the host by using the same MAC and IP ad- 
dresses, host name, DHCP details, and for wireless, the 
same SSID. It also handles traffic at the link and network 
layers, such as ARP requests and pings — thereby main- 
taining basic presence on the network. New incoming 
connection requests for the host processor are now re- 
ceived and handled by the network stack running on the 
secondary processor. In this way the PC’s transition into 
sleep is transparent to remote hosts on the network. 

To ensure that the host PC is reachable by various ap- 
plications, a process on the secondary processor mon- 
itors incoming packets. This process watches for pat- 
terns, such as requests on specific port numbers, which 
should trigger wake-up of the host processor. Although, 
this simple architecture [4, 7, 11] supports several ap- 
plications with minimal complexity, Somniloquy can get 
much greater energy savings for some applications by 
not waking up the host processor for simple tasks, for 
example, to send instant messenger presence updates. To 
perform these tasks on the secondary processor, we re- 
quire the application writer to add a small amount of 
application specific code (“stubs”) on the host and sec- 
ondary processor. In the rest of this section we describe 
in more detail how we handle various applications — with 
and without application stubs. 


3.1 Somniloquy without Application Stubs 


The Somniloquy daemon on the host processor speci- 
fies packet filters, 1.e. patterns on incoming packets, on 
which the secondary processor should wake up the host 
processor from sleep state. The Somniloquy daemon cre- 
ates filters at various layers of the network stack. At the 
link layer and network layer, the secondary processor can 
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Figure 2: Somniloquy software components on the host 
PC and the secondary processor, and their interactions. 


be told to wake the computer when it detects a particular 
packet, analogously to the magic packets used by Wake 
on LAN, though not requiring the MAC address to be 
known by the remote host (see further discussion in Sec- 
tion 6). Trigger conditions at the transport layer may also 
be specified, for example, wake on TCP port 23 for telnet 
requests. Similarly, Somniloquy also supports wake-ups 
on patterns in the application payload. 

Although the host PC will wake up within a few sec- 
onds, it will not receive the packet(s) that triggered the 
wake-up. One way to solve this problem is to buffer the 
packet on the secondary processor and replay it on the 
network stack of the host processor once it has woken 
up. However, since the time to wake up is just a few sec- 
onds, most sources can be relied upon to retry the con- 
nection request. For example, any protocol using TCP 
as the transport layer will automatically retransmit the 
initial SYN packet. Even UDP-based applications that 
are designed for Internet use are designed to cope with 
packet loss using automatic retransmissions. 

This simple packet filter based approach to trigger- 
ing wake-ups has the advantage that application-specific 
code does not need to be executed on the secondary pro- 
cessor. Nonetheless, it is sufficient to support many ap- 
plications that get triggered on remote connection re- 
quests, such as remote file access, remote desktop access, 
telnet and ssh requests to name a few. 


3.2 Application-specific Extensions 


Several applications maintain active state on the PC even 
when it is idle, and hence prevent a PC from going to 
sleep. For example, a movie download client on a home 
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PC (e.g. from Netflix) will require the host PC to be 
awake for a few hours while downloading the movie. An 
instant messenger (IM) client will require the PC to be 
on in order for the user to stay “online” (reachable) to 
their contacts. 

Somniloquy provides a way for these applications 
to consume significantly less power. By performing 
lightweight operations on the secondary processor, it 
can opportunistically put the host processor to sleep. 
For example, the secondary processor can send and re- 
ceive presence updates to/from the IM server while the 
host processor is asleep. During a large download, the 
secondary processor can download portions of the file, 
putting the host processor to sleep in the meantime. 

The key to supporting these applications is the use 
of stubs that run on the host and the secondary proces- 
sor. We have implemented stubs for three popular ap- 
plications — IM (MSN, AOL, ICQ), BitTorrent, and web 
download. Here, we will describe the general guidelines 
for writing these stubs, and describe the specific imple- 
mentations for the three applications in Section 4. 

Writing application stubs: When designing an appli- 
cation stub, the first step is to understand the subset of the 
application’s functionality that needs to run when the PC 
is asleep. This is implemented as a stub on the secondary 
processor. For example, for an IM stub, the functionality 
to send and receive presence updates is essential to main- 
tain IM reachability. However, the stub need not include 
any UI-related code — such as opening a chat window. 

We note that it is not feasible for the stub to reuse the 
entire original application code from the host PC. The 
application code might depend on drivers (display, disk, 
etc.) that are absent on the secondary processor. Further- 
more, running the entire application might overload the 
secondary processor. Therefore, only the essential com- 
ponents of the application are implemented as part of the 
application stub. 

Another step in designing application stubs is to de- 
cide when to wake up the host processor. Triggers can 
be user-defined, for example waking up on an incoming 
call from a specific IM contact. Triggers may also occur 
when the secondary’s processor’s resources are insuffi- 
cient, for example when the flash is full or more CPU re- 
sources are needed. In all of these cases, the stub wakes 
up the host processor. 

To interface with the application on the host PC and 
the Somniloquy daemon, the application stub needs to 
have a component on the host processor. This compo- 
nent registers two callback functions with the Somnilo- 
quy daemon — one that is called just before the PC goes 
to sleep and the other just after it has woken up. The 
first function transfers the application state to the stub on 
the secondary processor, and also sets the trigger condi- 
tions on which to wake the host processor. These val- 
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ues depend on the application being handled by the stub. 
The second callback function, which is called when the 
host resumes from sleep, checks the event that caused 
the wakeup — whether it was caused by a trigger con- 
dition on the secondary processor or due to user activ- 
ity. It handles these events differently. If the wakeup 
was caused by user activity, the stub transfers state from 
the secondary processor, and disables it. However, if the 
wakeup was caused by a trigger condition on the sec- 
ondary processor, the application stub handles it as de- 
fined by the user. For example, for an incoming VoIP 
call, the stub engages the incoming call functionality of 
the VoIP application. 


Having determined what functionality needs to be sup- 
ported by the application stub and host-based callbacks, 
and what state must pass between them, the final step is 
to implement this. We have used two manual approaches 
to doing this. For the download stub, we built all the 
functionality ourselves based on detailed knowledge of 
the application protocols, and for the BitTorrent and IM 
stubs, we trimmed down existing application code to re- 
duce memory and CPU footprint. An alternative could 
be to automatically learn protocol behavior to build these 
application stubs. However, we believe that this is an 
extremely difficult problem. There are parts of the ap- 
plication that are difficult to infer, and any inaccuracy in 
the application stub will make it unusable. For exam- 
ple, knowledge of how BitTorrent hashes the file blocks 
is necessary for the stub to successfully share a file with 
peers. We are unaware of any automatic tool that can 
learn such application behavior. Therefore, we believe 
that the best (although perhaps not the most elegant) 
approach to building these stubs is to modify applica- 
tion source code and remove functionality that is not re- 
quired by the secondary processor. In the future, with 
a greater incentive to save energy, we expect that appli- 
cation developers will compete for energy consumption, 
and hence provide stubs for their applications using the 
guidelines described in this section. 


We realize that partial application stubs might be cre- 
ated using tools such as the Generic Application-Level 
Protocol Analyzer [6] and Discoverer [8], which auto- 
matically learn the behavior and message formats for a 
range of protocols. As part of future work, we plan to 
explore how the knowledge of the protocol can be aug- 
mented with application-specific behavior to ease the de- 
velopment of application stubs. 

When to use application stubs? Not all applications 
are conducive to low-power operation via application 
stubs. A CPU intensive application, such as a compi- 
lation job, will be very slow on the secondary processor 
since it has a less powerful CPU and low memory. Simi- 
larly, an I/O intensive application, such as a disk indexer, 
will need to read the disk very often and will therefore 
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need the PC to be awake. Download and file sharing ap- 
plications are an interesting exception, because portions 
of a file can be transferred by the secondary processor 
whilst the host sleeps. We will discuss this approach in 
more detail in Section 4.4. 

Even for an application stub that saves energy for a 
given application, it is not always useful to offload the ap- 
plication to the secondary processor when the host PC is 
going to sleep. Several other applications may also want 
to run their application stubs on the secondary processor. 
This might overload the CPU of the (weaker, low power) 
secondary processor. In this case, it might be beneficial 
to keep the host PC awake. 

One way to solve this problem is to modify the Som- 
niloguy daemon to predict the CPU utilization of the 
stubs for all applications that are willing to be offloaded 
to the secondary processor. However, making this pre- 
diction is extremely difficult. There might be little cor- 
relation between the CPU utilization of the application 
on the host PC, and the stub on the secondary proces- 
sor, because of different processor architectures, and 
varying application demands. Instead, we take a sys- 
tems approach. We monitor the CPU utilization of the 
secondary processor; if it remains at more than 90% 
continuously(>30 seconds), we wake up the PC, and re- 
sume all applications on the host processor. If the CPU 
utilization of these applications decreases by more than 
10% on the host processor, we repeat the same procedure 
— offload to the secondary processor and stay there if 
CPU utilization is less than 90%. In our Somniloquy de- 
ployment the need to move applications arose when run- 
ning multiple application stubs on the secondary proces- 
sor, such as two concurrent 8 Mbps web downloads and 
two concurrent BitTorrent downloads of Section 5.3.2. 

Incremental Deployment: We realize that Somnil- 
oquy may never be universally deployed, and that get- 
ting software vendors to try for incremental deployment 
requires a low-effort mechanism to ensure that their 
Somniloquy-enhanced software is compatible with ma- 
chines and platforms that do not have Somniloquy sup- 
port. The Somniloquy daemon queries the OS to de- 
termine the presence of a secondary processor, and the 
supported application stubs. Applications then need to 
query the Somniloquy daemon, and invoke the applica- 
tion stubs only if the OS supports Somniloquy, and the 
corresponding stubs are implemented on the secondary 
processor. 


3.3. Quantifying Energy Savings 


The amount of energy saved through adoption of Som- 
niloquy is quite easy to predict; it depends on the relative 
power consumption of the awake and sleep states, and 
the proportion of time that a machine can be kept asleep 


when it would previously have been awake. For applica- 
tions without stubs, this proportion is largely dependent 
on the actions of a remote user - how frequently a re- 
mote ssh session is initiated for example, and for how 
long. On the other hand, for applications with stubs the 
secondary processor may regularly wake up the host to 
perform some task or other. We quantify the energy sav- 
ings for an application with different wake-up intervals 
in Section 5.4.4. 

More formally, suppose the host is woken up once ev- 
ery T’sicep Seconds, whereupon it stays awake for Tawake 
seconds. Ti wake Includes the time it takes to transfer 
data between the PC and the secondary processor. Also 
assume that dis sum of the time to wake up the host plus 
the time to transition to sleep. Suppose: 


e P, 1s the power consumption of the PC when it is 
awake (in W) 


e P, is power consumed in sleep mode (in W), and 


e P. is power consumed by the secondary (embed- 
ded) processor (in W) 


The energy (E) consumed during Somniloquy operation 
is given by: 
Pxoppelcoin = EPC Sigeniiode a EPCin Awake Mode 


Secondary Processor 
= T sleep * Ps + (Tnnane + d) *& Po 


“(lomake G+ Letees) Py) Otlles 


In the absence of Somniloquy, the amount of energy 
consumed by the host PC in the same time is Eyose = 
Pa * (Lawake + Tsitcep) Joules. Therefore, the ratio of 
energy consumed by Somniloquy compared to the host 
PC being always on is given by: 


Pasnuiaradin __ Teteen* (Pers) lan oF aa ot age) 


EF pgvei Pat] pagawe: late) 


Typically, as we show in Section 5, P. and P; are two 
orders of magnitude less than P, for a desktop computer, 
and dis around 10 seconds (to wake up the host, and put 
it back to sleep). Therefore, for most energy savings, 
we would want Towake to be much less than Tyjecep, 1.€. 
if Lawake <K L sleeps then the ratio De cminiioonn nos) 
is approximately (P. + P;)/P,. We will present the 
approximate energy savings for different applications in 
Section 4. 

Of course, Somniloquy could save more energy by dis- 
abling the secondary processor when the PC is awake. 
This would require the PC to enable the secondary pro- 
cessor before going to sleep, and disable it when the PC 
has woken up. We were unable to fully implement this 
functionality in our prototype, but we expect this to be a 
minor fix in a production system. 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 


369 


370 


3.4 Discussion 


Security: A common requirement of corporate IT de- 
partments is that all PCs should be up to date with the 
latest OS and application patches. Somniloquy can en- 
sure that this constraint is met even when PCs are asleep. 
This is achieved using a port-based trigger to wake up the 
host PC when the SMS (Systems Management Server) 
contacts the host PC to install updates. 

Somniloquy ensures that the secondary processor 1s 
secure by patching its OS whenever security updates 
become available. Also, it prevents attackers from re- 
placing the secondary processor by requiring that it be 
a physically part of the PC (as part of the network in- 
terface). In some cases however, the functionality that 
Somniloquy provides could be misused to conduct at- 
tacks that spuriously wake up the PC and waste energy. 
This kind of denial-of-service attack would be particu- 
larly effective for mobile devices where a drained bat- 
tery might result. One way to address this issue is to 
disable port triggers, and instead exclusively use appli- 
cation stubs which ensure that only authenticated remote 
hosts are allowed to trigger wakeup. 

Another concern is that application stubs, and hence 
the use of extra code, increases the PC’s attack surface. 
To mitigate the impact of this vulnerability we use a few 
techniques. First, the secondary processor only listens 
on ports that have been opened by applications on the 
host PC. Second, we require the PC and the secondary 
processor to be on the same administrative domain. 

We also note that modern processors have additional 
security features built in, for example an execute-disable 
bit, used by some applications to prevent executing ar- 
bitrary code and preventing buffer overflows. We realize 
that a low power processor may not currently support this 
advanced functionality, although we expect that in the 
future low-power chips will also be available with these 
features. 

Alternative Design: With the increasing prevalence 
of multi-core PCs, one idea to alleviate the need for the 
additional secondary processor introduced by Somnilo- 
quy would be to use one of the cores of the host CPU in- 
stead. Running just one core at the lowest possible clock 
frequency would minimize energy consumption and ob- 
viate the need for a separate low power processor in the 
NIC. 

However, it turns out that such an approach is not use- 
ful without significant modification to today’s PC archi- 
tecture. Our measurements (see Section 5.1) show that 
the power consumption of a multi-core PC with only one 
core active, running at the lowest permissible clock speed 
is still approximately 50 times that of our low power sec- 
ondary processor, even with all other peripherals in their 
lowest power modes — e.g. disk spun down. This is be- 
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cause of the lack of truly fine-grained power control of 
PC components such as the Northbridge, Southbridge, 
memory buses, parts of the storage hierarchy and various 
peripherals. Even if fine-grained control were available, 
the base power consumption of individual components 
(NIC, hard drive) is significant (see Table 2). One way 
to reduce this base power draw would be to have a sep- 
arate and relatively simple core with a small amount of 
associated memory running from a separate power do- 
main so that it can function without powering on other 
components. Such an architecture is very similar to Som- 
niloquy, and most of our design principles can easily be 
adopted. 


4 Prototype Implementation 


We have prototyped Somniloquy using gumstix, a low 
power modular embedded processor platform manufac- 
tured by Gumstix Inc that support a wide variety of pe- 
ripherals. 


4.1 Hardware and Software Overview 


An important goal when prototyping Somniloquy was to 
have it work with existing unmodified desktops and lap- 
tops, and for both wired and wireless networks. Further- 
more, we required the platform to be low power, have 
a small form factor, and be well supported for develop- 
ment. The gumstix platform served all these design re- 
quirements well. The specific components we use for 
Somniloquy include a connex-200xm processor board, 
an etherstix network interface card (NIC) (for wired Eth- 
ernet), a wifistix NIC (for Wi-Fi), and a thumbstix com- 
bined USB interface/breakout board. The connex-200xm 
employs a low power 200 MHz PXA255 XScale pro- 
cessor, with 16 MB of non-volatile flash and 64 MB of 
RAM. The etherstix provides a 10/100BaseT wired Eth- 
ernet interface plus an SD memory slot to which we have 
attached a 2GB SD card. The thumbstix provides a USB 
connector, serial connections and general purpose input 
and output (GPIO) connections from the XScale. 

To enable Somniloquy we needed mechanisms to 
wake-up the host PC, and also to detect its state (awake 
or in S3). To achieve this we added a custom de- 
signed circuit board that incorporates a single chip — the 
FT232RL from FTDI. The FT232RL is a USB-to-Serial 
converter chip supporting functionality such as sending 
a resume signal to the host and detecting the state of the 
host, both over the USB bus. This board is attached to 
the computer via a second USB port and to the thumb- 
stix module (and thence to the XScale processor) via a 
two-wire serial (RS232) interface plus two GPIO lines. 
One GPIO line is connected to the FT232RL’s ‘ring indi- 
cator’ input to wake up the computer. The second GPIO 
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Figure 3: Block diagram of the Somniloquy prototype 
system - Wired-INIC version. The figure shows various 
components of the gumstix and the USB interfaces to the 
host laptop. 


line is connected to the FT232RL’s ‘sleep’ output which 
can be polled by the gumstix to detect whether the host 
PC is active or in S3. 


As mentioned above (and shown in Figure 3), the com- 
puter is connected to the secondary processor via two 
USB connections. One of these provides power and two- 
way communications between the two processors. It is 
configured to appear as a point-to-point network inter- 
face (“USBNet’”), over which the gumstix and the host 
computer communicate using TCP/IP. The second USB 
interface provides sleep and wake-up signaling, and a se- 
rial port for debugging purposes. The use of two USB 
interfaces is not a fundamental requirement, it is simply 
for ease of prototyping. 


Since we use standard USB ports for interfacing with 
the host and for sleep signaling, our prototype works on 
any recent desktop or laptop that supports USB. We run 
an embedded distribution of Linux on the gumstix that 
supports a full TCP/IP stack, DHCP, configurable routing 
tables, a configurable firewall, SSH and serial port com- 
munication. This provides a flexible prototyping plat- 
form for Somniloquy with very low power operation. 


We have implemented the Somniloquy host software 
on Windows Vista. The Somniloquy daemon detects 
transition to S3 sleep state, and before this is allowed 
to occur we transfer the network state (MAC address, IP 
address, and in the case of the wireless prototype, the 
SSID of the AP) and other information about the wakeup 
triggers as discussed in Section 3. 
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Figure 4: Photograph of the gumstix-based Somniloquy 
prototype - Wired-INIC version. 


4.2 Three different prototypes 


We have prototyped three different Somniloquy designs 
to explore different aspects of operation. The first uses 
the gumstix as an augmented Ethernet interface, as de- 
scribed in Section 3. However, in our prototype this has 
some performance limitations so we have also imple- 
mented a second design which uses the gumstix in co- 
operation with an existing high-speed Ethernet interface. 
Finally, we have a Wi-Fi version. All three prototypes 
are described in further detail below: 

Augmented Network Interface: We call this imple- 
mentation the Wired-INIC version. The architecture is 
shown in Figure 3, with a photograph of the prototype 
shown in Figure 4. In this prototype, we disable the NIC 
of the host, and configure the PC to use the USBNet in- 
terface (USB connection between the gumstix and the 
host) as its only NIC. The gumstix is connected to the 
network using its Ethernet connection. To enable the host 
PC to be on the network, we set up a transparent layer-2 
software bridge between the USBnet interface to the host 
and the Ethernet interface of the gumstix. This bridge is 
active when the host is awake. When the host transitions 
to sleep, the gumstix disables the bridge, and resets the 
MAC address of its Ethernet interface to that of the US- 
BNet interface of the host. The gumstix thus appears to 
the rest of the network as the host itself, since it has the 
same network parameters (IP, MAC address). When the 
host wakes up, the gumstix resets its MAC address to its 
original value and starts bridging traffic to the host again. 

Although our Wired-I/NIC prototype hardware sup- 
ports a 100 Mbps Ethernet interface, we are limited to a 
throughput of 5 Mbps due to the bandwidth supported by 
the USBNet interface driver. There is also a slight over- 
head of bridging traffic on the gumstix. Although this 
limits bandwidth to the host significantly in our proto- 
type, we note that in a final integrated version, this over- 
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head of bridging can be avoided by allowing both the 
host and the low power secondary processor to access 
the NIC directly. 

Using Existing Network Interface: Somniloquy can 
coexist with an existing NIC. On such systems, the over- 
head of bridging is avoided by using the existing Ether- 
net interface on the host PC for data transfer when it is 
awake, with the gumstix using its own Ethernet interface 
(while still impersonating the host PC) when the host is 
asleep. We have built this version where the gumstix 
does not perform Layer-2 bridging, and call it the Wired- 
2NIC prototype. 

Using Wi-Fi: We have also implemented a wireless 
version of Somniloquy. We were unable to implement a 
one-NIC version since the Marvell 88W8385 802.11 b/g 
chipset present on the wifistix does not currently sup- 
port layer 2 bridging. We have however implemented a 
Wireless-2NIC version. 


4.3. Applications Without Stubs 


We have implemented a flexible packet filter on the gum- 
stix using the BSD raw socket interface to support appli- 
cations that do not require stubs, e.g. RDP, SSH, telnet 
and SMB connections. Every application in this class 
provides a regular expression matched against incoming 
packets to decide whether to trigger host wakeup. For 
example, handling incoming remote desktop requests re- 
quires the host to be woken up when the gumstix receives 
a TCP packet with destination port 3389. 

We note that waking up the host computer is not 
enough; the incoming connection request must somehow 
be conveyed to the host. We accomplish this by using 
the iptables firewall on the gumstix to filter any re- 
sponse to TCP or UDP packets that the gumstix does not 
handle itself. Thus trigger packets are not acknowledged 
by the gumstix and the remote client sends retries. Af- 
ter the host has resumed, one of the retries will reach it 
(since it is still using the same IP and MAC addresses) 
and it will respond directly. Using port-based filtering, 
we have implemented wake-up triggers for four appli- 
cations: remote desktop requests (RDP), remote secure 
shell (SSH), file access requests (SMB), and Voice over 
IP calls (SIP/VoIP). 


4.4 Applications Using Stubs 


To demonstrate how modest application stubs can enable 
significant sleep-mode operation in Somniloquy, we have 
also implemented application stubs for three applications 
that were popular in our informal survey: background 
web download, peer to peer content distribution using 
BitTorrent, and instant messaging. For all these appli- 
cations, we did not have to modify the operating system 
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or the existing applications on the PC, which were only 
available to us in binaries. To capture the state of the 
application for the respective stub, we wrote wrappers 
around the binaries. 

Background Web Downloads: We developed the 
web download stub for wget which works as follows: 
When the host PC transitions to sleep, the status of ac- 
tive downloads is sent to the stub running on the gum- 
stix. The status includes the download URL, the offset 
of how much download has taken place, the buffer space 
available, and the credentials (if required for the down- 
load). Most popular web servers (e.g. IIS and Apache) 
allow these byte ranges to be specified using the HTTP 
‘Accept-Ranges’ primitives [22]. The web download 
stub then resumes the downloads from the respective off- 
sets of the files, and stores the data on the flash storage 
of the gumstix. If the flash memory fills up before the 
downloads complete, the stub wakes up the host PC and 
transfers the downloaded files from flash storage to the 
host PC, thereby freeing up space. The host PC then goes 
back to sleep while the stub continues the downloads. At 
the end of a download, the gumstix wakes up the host 
PC, and transfers the remaining part of the file. 

The download stub consumes significantly less energy 
to download a file than keeping the PC awake to down- 
load it. The overhead is a slight increase in latency. We 
can quantify the savings and overhead using the model 
described in Section 3.3. If flash storage is F’ MB and 
the download bandwidth is 6 MBps, then the host PC is 
woken up every F'/B seconds, and it is awake for F'/T 
seconds, where TJ’ is the transfer rate between the host 
and the gumstix. Therefore, using the formula in Sec- 
tion 3.3, Somniloquy gives most energy savings at low 
B and high 7’. We empirically validate this observation 
in Section 5.4.4. When 7’ is of the same order as B, 
Somniloquy might not save much energy. This can hap- 
pen if the NIC supports very high rates (e.g. 1 Gbps), 
while the secondary processor can only support lower 
data rates (up to 100 Mbps) or if the transfer rate 7’ is 
limited. However, we anticipate the download stub to be 
primarily used in scenarios where the download speeds 
are limited by the last mile connection of at most a few 
tens of Mbps — here, this stub is nearly always beneficial. 

BitTorrent: For the BitTorrent stub we customized 
a console-based client, ctorrent, to run on the gumstix 
with a low CPU utilization and memory footprint. Prior 
to suspending to S3, the host computer transfers the ‘.tor- 
rent’ file and the portion of the file that has already been 
downloaded to the gumstix. The BitTorrent stub on the 
gumstix then resumes download of the torrent file and 
stores it temporarily on the SD flash memory of the gum- 
stix. When the download completes, the stub wakes up 
the host and transfers the file. 

When only downloading content, the energy saved by 
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using this stub is similar to that of the web download 
stub, 1.e., frequency of waking up the PC and the duration 
for which it is woken up depends on the download band- 
width B, the transfer speed T' and the flash size F’. How- 
ever, when uploading/sharing (which is key to altruis- 
tic P2P applications), the energy savings are much more. 
The same file chunk can be uploaded to many peers, and 
hence the PC can sleep for much longer — implying more 
energy savings using the formula in Section 3.3. 

Instant Messaging: For the IM stub, we used a 
console-only IM client called finch that supports many 
IM protocols such as MSN, AOL, ICQ, etc. On the PC, 
we used the corresponding GUI version of the IM client. 
To ensure our goal of a low memory and CPU footprint 
we customized finch to include only the features salient 
to our aim of waking up the host processor when an in- 
coming chat message arrives. This only requires authen- 
tication, presence updates and notifications; we disabled 
other functionality. The host processor transfers over the 
authentication credentials for relevant IM accounts be- 
fore going to S3. The gumstix then logs into the rele- 
vant IM servers, and when an incoming message arrives 
it triggers wakeup. The energy saved by the IM stub is 
thus similar to applications that are handled using packet 
filters (e.g. SSH/RDP), where the duration for which a 
host can sleep depends on the frequency of occurrence 
of wake-up triggers. 


5 System Evaluation 


We present the benefits of Somniloquy in four steps. 
First, we show that gumstix consumes much less power 
than a PC by profiling standalone desktops, laptops and 
the gumstix in different power states. Second, we mea- 
sure the energy saved (and latency introduced) by Som- 
niloquy when used on an “idle” host processor. Third, we 
show how Somniloquy affects the performance of vari- 
ous applications, with and without application stubs. Fi- 
nally, we quantify Somniloquy’s energy savings — mon- 
etary and environmental cost for an enterprise and bat- 
tery lifetime increase for laptops. 

Methodology: To measure the power consumption of 
laptops and desktop PCs, we used a commercially avail- 
able mains power meter, Watts-Up °. To measure the 
power consumption of the standalone gumstix, we built 
a USB extension cable with a 100 m{2 0.1% sense resis- 
tor, which was inserted in series with the +5 V supply 
line, and we used this cable to connect the gumstix to the 
computer. We calculated the power draw of the gumstix 
by measuring the voltage drop across the sense resistor. 
All power numbers presented in this section are averaged 
across at least five runs. 


shttp: //www.wattsupmeters.com/ 


Condition Optiplex | Dimension 
745 4600 


Normal idle state 
Lowest CPU frequency 


Disable multiple cores 
‘Base power’ 


Time to enter S3 94s 5.85 
Table 1: Power consumption and S3 suspend/resume 
time for two desktops under various operating condi- 
tions. In all cases the processor is idle and the hard disk 


is spun down. The power consumed by other peripherals 
such as displays is not included. 





Condition Lenovo | Toshiba | Lenovo 
X60 M400 T60 


Normal idle state 
Backlight minimum 
Screen turned off 
‘Base power’ 


Suspend state (S3) 0.74W | 1.15 W | 0.55 W 
Battery capacity 65 Wh 50 Wh 85 Wh 


Base lifetime 5.9h 24 4.0h 
Suspend lifetime 88h 43h 155h 
Time to enter S3 8.75 5.58 4.9s 

Time to resume from S3 3.058 3.658 4.85 





Table 2: Power consumption and battery lifetime of three 
laptops under various operating conditions, and the time 
to change power States. 


5.1 Microbenchmarks — Power, Latency 


Desktops: Table 1 presents the average power consump- 
tion for two Dell desktop machines: an Intel dual core 
(2.4 GHz Core2Duo) OptiPlex 745 with 2GB RAM run- 
ning Windows Vista, and a 2.4GHz Pentium 4 Dimen- 
sion 4600 with 512 MB RAM running Windows XP. The 
display is turned off in these experiments, and only the 
essential system processes are left running. The power 
consumption of the desktop in S3 is two orders of mag- 
nitude less than when it is awake. This is consistent with 
prior published data on the power consumption of mod- 
ern PCs [7]. We use the term “base power’ to indicate the 
lowest power mode that a PC can be in and still be re- 
sponsive to network traffic (without using Somniloquy). 
To get this number, we further scaled down the CPU to 
the lowest permissible frequency on these desktops. Fur- 
thermore, we disabled the multi-core functionality using 
the system BIOS to effectively use only one core and 
verified that the system was actually doing so by using 
a processor ID utility supplied by Intel. The time taken 
for the desktops to resume from S3 and reconnect to the 
network is of the order of a few seconds (Table 1). 
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Ls Gumstix state Power 
is Wired version 


gumstix only - no Ethernet 
gumstix + Ethernet idle 
gumstix + Ethernet bridging 
gumstix + Ethernet + write to flash 
gumstix + Ethernet broadcast storm 


210 mW 
1073 mW 
1131 mW 
1675 mW 
1695 mW 
1162 mW 


gumstix + Ethernet unicast storm 


= Wireless version 


gumstix only — no Wi-Fi 
gumstix + Wi-Fi associated (PSM) 
gumstix + Wi-Fi associated (CAM) 
gumstix + Wi-Fi broadcast storm 
gumstix + Wi-Fi unicast storm 


210 mW 
290 mW 
1300 mW 
1350 mW 
1600 mW 





Table 3: Power consumption for the gumstix platform in 
various states of operation. 


Laptops: Table 2 presents the average power con- 
sumption of three popular laptops: a Lenovo X60 tablet 
PC with 2GB RAM running Windows Vista, a Toshiba 
laptop with 1GB RAM running Windows XP, and a 
Lenovo T60 laptop with 1GB RAM running Windows 
Vista. For all power measurements, the processor is set 
to the lowest speed and is idle, the hard disk is spun down 
and the wireless network interface is powered on. The 
base power is between 11 W and 22 W, resulting in a bat- 
tery lifetime of around 4 to 5 hours with the batteries that 
are present on these laptops. Using the sleep/S3 state 
can dramatically extend the battery lifetime, to between 
40 and 150 hours for the laptops we tested, although the 
laptop is unreachable in this state. 


Gumstix: Table 3 shows the average power con- 
sumed by the gumstix (with both etherstix and wifistix) 
in various states of operation. The gumstix has a base 
power of approximately 210mW when no network in- 
terface is present (row 1). A gumstix with an active net- 
work interface typically consumes approximately 1070- 
1300 mW (rows 2 and 9), however with an associated 
Wi-Fi interface in power save mode it consumes only 
290 mW (row 8). The power consumption of the gumstix 
when its network interface is active and the downloaded 
data is being written to flash is around 1675 mW (row 
4). Broadcast and unicast ‘storms’ (continuous traffic) 
increase the power consumption by a few hundred milli- 
wattst. Importantly, the power consumption of the gum- 
stix 1s approximately one tenth that of an awake laptop in 
the lowest power state, and approximately 50 times less 
than an idle desktop. 


4Wi-Fi broadcasts are sent at 6 Mbps while unicasts are sent at 
54 Mbps in our setup. Consequently a unicast storm consumes more 
power than a broadcast storm. 
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Figure 5: Power consumption and state transitions for 
our desktop testbed. 


5.2. Somniloquy in Operation 


We now report the power consumption of Somniloquy in 
operation. For these measurements we use two testbed 
systems: a desktop (Dell OptiPlex 745 with 2GB RAM 
running Windows Vista) with the Wired-1 NIC prototype 
of Somniloquy, and a laptop (Lenovo X60 tablet PC run- 
ning Windows Vista) with the Wireless-2NIC version of 
Somniloquy. Thus, our tests span both Ethernet and W1- 
Fi networks, and both the integrated single network in- 
terface, and the higher performance versions which uses 
the existing internal network interface. The test traffic is 
generated using a standard desktop machine running on 
the same (wireless or wired) LAN subnet as the testbed 
machine. 

Figure 5 shows the power consumption of our desktop 
testbed. Initially the desktop’s host processor is awake 
and uses the gumstix for bridging, and the whole sys- 
tem draws 104 W of power. At time ‘A’ a state change 
to S3 is initiated by the user. This request completes at 
time ‘B’ after which the power draw of the system is 
approximately 4.4 W, i.e. 24x less. This power is split 
between the gumstix, the DRAM of the PC, and other 
power chain elements in the PC. Subsequently at time 
‘C’ the gumstix, which has been actively monitoring the 
network interface, wakes up the host in response to a net- 
work event. This request completes at time ‘D’ when the 
host system has fully resumed. As the figures illustrate 
this resume event takes about 4 seconds. We do not show 
the laptop figure for space reasons; the trace looks very 
similar with a starting power of 16 W with the screen on 
(which drops to 11 W if the screen is turned off), a power 
draw of 1 W when using Somniloquy (11x less than the 
screen-off case) and a resume time of 3 seconds. 


3.3 Application Performance 


As described earlier there are two classes of applications 
that are supported by Somniloquy: first, a large class of 
applications that do not require application stubs, and 
second a smaller class of applications that can be sup- 
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Figure 6: Application-layer latency for three Somniloquy 
testbeds and four application types. 


ported using application stubs running on the gumstix. 
We performed a number of experiments to evaluate the 
performance of both these classes of applications. 


5.3.1 Applications without stubs 


We now quantify the end-to-end latency (as perceived by 
users) incurred by the applications that are handled by 
Somniloquy without using application stubs. For these 
experiments, we use the same two testbeds as above, with 
the addition of a third testbed based on the Wired-2NIC 
prototype (using same desktop machine as the Wired- 
INIC case), providing a direct comparison between the 
INIC and 2NIC cases. In each case the latency reported 
is the mean over five test runs. 

Figure 6 reports the time taken to satisfy an incoming 
application-layer request for four sample applications. 
For each application, we show the latency for “awake” 
operation (i.e. when the host is on and directly responds 
to the request) and when the host is in S3 and Somnilo- 
quy prototype receives the incoming packet and triggers 
wake-up of the host. 

The four applications we tested were: 

Remote desktop access (RDP): Here we used a stop- 
watch to measure the latency between initiating a remote 
desktop session to the host and the remote desktop be- 
ing displayed. A stopwatch was used to ensure that true 
user-perceived latency was measured. The gumstix was 
configured to wakeup the main processor on detecting 
TCP traffic on port 3389 (the RDP port). 

Remote directory listing (SMB): A directory listing 
from the Somniloquy testbed was requested by the tester 
machine (via Windows file sharing, which is based on the 
SMB protocol). The time between the request being ini- 
tiated and the listing being returned was measured using 
a simple script. The secondary processor was configured 
to initiate wake-up on detection of traffic on either of the 


TCP ports used by SMB,i.e. ports 137 and 445. 

Remote file copy (SMB): The SMB protocol was 
used again, but this time to transfer a 17 MB file from 
the Somniloquy testbed to the tester machine. 

VoIP call (SIP): A Voice-over-IP call was placed to 
a user who had been running a SIP client on the Som- 
niloquy laptop before it had entered $3. On receipt of 
the incoming call the SIP server responded with a TCP 
connection to the testbed, causing the gumstix to trig- 
ger wakeup. A similar procedure was used in [2]. Once 
again, the latencies were measured using a stopwatch to 
measure true user-perceived delay. 

As Figure 6 shows, Somniloquy adds between 4-10s 
latency in all cases. As described in Section 5.2 earlier, 
part of this latency is attributed to resuming from S3, 1.e. 
4-5 s for the desktop and 2-3s for the laptop, and is in- 
dependent of Somniloquy. Further latency is due to the 
delay for TCP to retransmit the request, and for the host 
to respond to the request (which may take longer since 
it has just resumed). Note that the Wired-1NIC proto- 
type shows higher latency than the Wired-2NIC proto- 
type. This is purely an artifact of our prototype caused 
by the overhead of MAC bridging and largely the slower 
speed of the USBNet IP link between the gumstix and 
the host. The latter is particularly obvious in the file copy 
test, where the file copy time with the Wired-2NIC case 
is much faster than for Wired-1NIC (although the Wired- 
INIC speed is still faster than Wireless-2NIC). While 
Somniloquy does result in 4-10s additional application- 
layer latency, these delays are acceptable for real usage 
(including VoIP [2]) in exchange for the substantial ben- 
efit of 20x-50x power savings. 


5.3.2 Applications Requiring Stubs 


In this section we present evaluations for applications 
that require stub support on the gumstix, primarily look- 
ing at the overhead in terms of memory consumption 
and processing capabilities that they impose on the gum- 
stix. We have implemented application stubs for three 
common applications — background downloads using 
the http protocol, P2P file sharing using BitTorrent, and 
maintaining presence on IM networks — as described in 
Section 4. 

To study the overhead of IM clients, we run the cor- 
responding application stub using up to three different 
IM protocols simultaneously — MSN Messenger, AOL 
Messenger and [CQ Chat. Table 4 shows the processor 
utilization and memory footprint of the Wired-1 NIC pro- 
totype when running these IM clients. Since the behav- 
ior of the IM stub is such that it maintains presence of 
the user on various networks and on receipt of an appro- 
priate trigger (IM from someone) wakes up the host, the 
latency values are similar to those of the VoIP application 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 


375 


376 


Accounts Processor Memory 
95th percentile | 95th percentile 


None 
MSN only 


MSN+AOL 
MSN+AOL+ICQ 


Table 4: Processor and memory utilization for the IM 
stub for various configurations. Total memory for the 
gumstix is 64 MB. 


Configuration Processor Memory 
95th percentile | 95th percentile 
Single download 


4MB cache 
SMB cache 
16MB cache 


Two simultaneous downloads (4 MB cache) 


lst download 16% 6.5 MB 
2nd download 24% 7.0 MB 





Table 5: Processor and memory utilization for the Bit- 
Torent stub for various configurations. Total memory for 
the gumstix is 64 MB. 


as reported in Figure 6. For our Wired-1NIC prototype 
the additional latency for the IM stub when using Som- 
niloquy is around seven seconds. 

To evaluate the overhead of P2P file sharing using the 
BitTorrent stub on the gumstix, we initiated downloads 
using a torrent from a remote website> into the 2GB SD 
card of the Wired-INIC gumstix. We varied the mem- 
ory cache available to the stub while conducting a single 
download, and then tested two simultaneous downloads. 
The results in Table 5 show that the memory footprint of 
the stub increases proportionally to the cache size as ex- 
pected, while the processor utilization remains constant. 
When there are two simultaneous downloads, each in- 
stance of the stub uses memory proportional to its speci- 
fied 4 MB cache. 

Finally, to evaluate the web-download stub on the 
gumstix we initiate download of a large (300 MB) file 
from a local web server. We varied the throughput of 
the downloads and measured the processor utilization 
and the memory consumption of the gumstix, and exper- 
imented with two simultaneous downloads. As shown in 
Table 6, the processor utilization increases as the down- 
load rate increases although the memory footprint for 
each download remains constant. 

The above results show that using application stubs, 
we can support fairly complex tasks and applications, in- 
cluding background web downloads and P2P file shar- 


http://www.legaltorrents.com/ 
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Configuration Processor Memory 
95th percentile | 95th percentile 
Single download 


9.2% 
21% 
50% 


Two simultaneous downloads (4 Mbps each) 


31% 1.8 MB 
1.8 MB 





26.3% 





1st download 
2nd download 
Table 6: Processor and memory utilization for the web 


download stub for various configurations. Total memory 
for the gumstix is 64 MB. 


ing using relatively modest resources on the gumstix. It 
is important to note that the power consumption of the 
gumstix did not exceed 2 W in all of these experiments. 


5.4 Energy Savings using Somniloquy 


In addition to evaluating the operating performance of 
our Somniloquy prototypes, it’s also important to assess 
the higher level goal of this work, namely the impact on 
PC energy consumption. In this section we present some 
data which demonstrates the potential of Somniloquy to 
reduce both desktop and laptop energy usage in general 
terms. We also verify the energy saving model presented 
in Section 3.3, which allows the specific savings in a 
given application scenario to be calculated. Unless other- 
wise noted, we are using the Wired-1 NIC version of our 
prototype for the desktop energy measurements and the 
Wireless-2NIC version for the laptop energy measure- 
ments. 


5.4.1 Reducing Desktop Energy Consumption 


Our testbed desktop PC consumes 102 W in normal op- 
eration and <5 W in S3 with Somniloquy. Somniloquy 
therefore saves around 97 W. On this basis, if Somnilo- 
quy were to be deployed in an environment where a PC 
is actively used for an average of 45 hours each week 
(i.e. 27% of the time), this would result in 620kWh 
of savings per computer in a year. Assuming 0.61 kg 
CO;/kWH® and US$ 0.09/kWH’, this means an annual 
saving of 378 kg of COz (to put it in perspective, the av- 
erage US residents annual COz2 emissions are 20 metric 
tonnes as compared to a worldwide average of 4 met- 
ric tonnes per person®) and US$56 per computer. We 


http: //www.eia.doe.gov/cneaf/electricity/ 
page/co2 report/co2report .html 

http: //www.eia.doe.gov/cneaf/electricity/ 
epa/epa_sum.html 

Shttp: //www.sciencedaily.com/releases/2008/ 
04/080428120658.htm 
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Figure 7: Power consumption and the resulting estimated 
battery lifetime of a Lenovo X60 using Somniloquy. The 
lifetime is calculated using the standard 65 Watt hour 
battery of the laptop. 


believe this is significantly higher than the bill of ma- 
terials cost of the components required to implement 
a commoditized Somniloquy-enabled network card. In 
this case, deployments of Somniloquy-enabled devices 
would pay for themselves within a year. 


5.4.2 Desktop Energy Savings for Real Workloads 


We now estimate the energy savings enabled by Somnil- 
oquy under realistic workloads. We use the data provided 
by [20], relating to the use patterns of twenty two distinct 
desktop PCs; each of which is classified as being either 
idle, active, sleep or turned off. We then compute the 
energy consumed by each of the PCs with and without 
Somniloquy using the formula of Section 3.3. For ease of 
exposition, we bin the data into three different categories: 
PCs that are idle for <25% of the time (7 machines), idle 
for 25%-75% of the time (6 machines) and finally those 
that are idle for >75% of the time (9 machines). The 
average energy savings for these twenty two PCs when 
using Somniloquy is 65%, as compared to normal oper- 
ation without Somniloquy. The average energy savings 
for the PCs in the individual categories are 38%, 68% 
and 85% respectively. As expected, the most energy sav- 
ings are for the PCs with larger idle times since they have 
more opportunity to use Somniloquy. 


5.4.3. Increasing Laptop Battery Lifetime 


Figure 7 shows the average power consumption of the 
laptop testbed when operating normally (i.e. no power 
saving mechanisms), with standard power saving mech- 
anisms in place (the baseline power), when Somniloquy 
(Wireless-2NIC) is operational, and in the standard S3 
mode (without the gumstix attached). Somniloquy adds 
a relatively low overhead of 300 mW to S3 mode, result- 
ing in a total power consumption which is close to just 


%Energy Savings (Analytical) mm %Energy Savings (Measured) 


m %Latency Increase (Analytical) ® %Latency Increase (Measured) 
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Figure 8: Comparing the analytical results with the mea- 
sured values for the web-download stub. The flash stor- 
age available on the gumstix is set to [OO MB, unless 
stated otherwise. 


1 W, as compared to the 11 W of the idle laptop. This 
means that when the laptop needs to be attached to the 
network and available for remote applications but is oth- 
erwise idle, it can be put into Somniloquy mode to enable 
an order of magnitude decrease in power consumption 
and a resulting increase in battery lifetime from 5.9 hours 
to 63 hours (using the standard 65 Watt-Hour battery). 


5.4.4 Energy Savings for Specific Applications 


The basic analysis of energy consumption and battery 
lifetime presented above is very generic; for a given us- 
age scenario it should be possible to use the energy sav- 
ing model presented in Section 3.3 to predict savings 
much more accurately. In order to validate this model 
we ran experiments downloading content from a remote 
web server, and measured both energy consumption and 
latency so as to compare them with their corresponding 
analytical values. Note that we only measure the energy 
consumption for the duration of the application. 

The web download stub was chosen since it was rela- 
tively easy to change the duty-cycle of the host, i.e. the 
duration for which the host can sleep (Tsicep) after which 
it needs to be woken up to transfer data from the gumstix 
(Tawake). AS discussed in Section 3.3, T’sjeep depends on 
the download bandwidth and the amount of flash storage 
on the gumstix, while Ty¢%- depends on the amount of 
flash storage on the gumstix and the transfer rate between 
the gumstix and the host. We downloaded a 300 MB 
file at various link bandwidths ranging from 512 Kbps 
to 2 Mbps, and used two different flash storage sizes at 
the gumstix - 100 MB and 200 MB, effectively varying 
T’siceep from approximately 1600 seconds down to 400 
seconds. We measured the power consumed during the 
download using the methodology described in the begin- 
ning of this section. In Figure 8, we present the measured 
energy savings and the corresponding predicted values 
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using our model for four different data points. As we 
can see from the figure, the predicted energy savings and 
the increased latency closely match the measured values 
(within 1.5%). The values do not exactly match since 
the actual measured power values vary over time, and 
the time taken to suspend and resume also varies across 
runs. We used a fixed value for these in the formula. 

Figure 8 also illustrates that increasing the bandwidth 
from 512 Kbps to 2 Mbps reduces the energy savings 
from 85% to 50%, and increases the latency from 11% to 
43%, although a larger amount of flash storage improves 
the energy saving and latency. As explained earlier this 
is due to the limited transfer speed of the USBnet inter- 
face in our prototype (<5 Mbps), because of which the 
PC is awake for longer periods of time while transfer- 
ring the data from the gumstix (Ty wake= 181 seconds to 
transfer 100 MB of data). In Figure 8 we have also plot- 
ted an ideal case (1 Mbps-ideal) where the host can read 
the flash storage of the gumstix directly. For the ideal 
case the duration for which the host needs to stay awake 
to transfer data from the gumstix reduces considerably 
(Tawake= 23 seconds). This improves energy savings to 
91% and limits the increase in latency when using Som- 
niloquy to less than 5%. 


6 Related Work 


There have been several proposals to reduce the en- 
ergy consumption of desktop PCs and laptops. Prior 
work can largely be grouped in three categories: re- 
ducing the active power consumption of devices (when 
awake) [3, 5, 9, 10, 16, 17], reducing the power con- 
sumption of the network infrastructure (e.g. routers and 
switches) [11, 12, 21], and opportunistically putting the 
devices to sleep. Somniloquy falls in the third category. 
Since a machine in sleep state consumes significantly 
less power than in lowest power active state [11, 27] (ver- 
ified by us in Section 5), significant energy savings are 
possible by putting the machine to sleep whenever pos- 
sible. 

For opportunistic-sleep systems, the biggest challenge 
is to ensure connectivity when the host is asleep. Prior 
techniques to solve this problem either use advanced 
functionality in the NIC [18] or use extra network in- 
terfaces [26, 27]. We now compare and contrast Somnil- 
oquy to both these classes of work. 

Among schemes that do not use an extra net- 
work interface, the most well-known are Wake-on-LAN 
(WoL) [18] and its wireless equivalent, Wake-on-WLAN 
(WoWLAN). In both these schemes, the NIC parses in- 
coming packets when the host is asleep. It wakes up 
the host PC whenever an incoming “magic” packet is re- 
ceived. According to the specification [18], the magic 
packet payload must include 6 characters of a wakeup 
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pattern that is set by the host PC, followed by 8 copies 
of the NIC’s MAC address. In WoWLAN, the only dif- 
ference is that this packet is sent over the Wireless LAN. 
Although most modern NICs implement WoL function- 
ality, few deployed systems actually use this function- 
ality, due to four main reasons. First, the remote host 
must know that the PC is asleep and that it must wake 
it up before pursuing application functionality. Second, 
the remote host must have a way of sending a packet to 
the sleeping PC through any firewalls/NAT boxes, which 
typically do not allow incoming connections without spe- 
cial configuration. Third, the remote host must know 
the MAC address of the sleeping PC. Fourth, WoWLAN 
does not work when laptops change their subnet because 
of mobility. In contrast, Somniloquy does not require the 
extra configuration of firewalls/NAT boxes, and is trans- 
parent to remote application servers. It can handle mo- 
bility across subnets since the secondary processor can 
re-associate with services such as Dynamic DNS (to redi- 
rect a permanent host name to the PC’s new IP address), 
and re-log-in to servers such as IM servers. In addition 
to these differences, Somniloquy also allows applications 
to be offloaded to the low power processor. There is no 
such concept in WoL, which instead wakes up the host 
when any pattern is matched. 


Intel recently announced its Remote-Wake’ [14] 
chipset technology (RWT) that claims to extend WoL on 
new motherboards by allowing VoIP calls to wake up a 
system, although its general applicability to other appli- 
cations is not known. The details of this technology are 
not published. In contrast, Somniloquy goes beyond just 
WoL or RWT. It allows low power operation for various 
applications other than VoIP. Furthermore, Somniloquy 
does not require modifications to application end points 
or servers. RWT requires applications to first contact a 
server, which then sends a special packet to the PC to 
signal a wake up. 


Another approach is to use additional “low-power” 
network interfaces to maintain connectivity to the PC that 
is asleep. This approach has been proposed for use with 
mobile devices. For example, Wake-on-Wireless [26] 
wakes up the host PC on receiving a special packet on 
the low power network interface. Turducken [27] uses 
several tiers of network interfaces and processors with 
different power characteristics, and wakes up the upper 
tier when the lower tier cannot handle a task. In con- 
trast to these schemes, Somniloquy requires only a single 
network interface, and presents the paradigm of a single 
PC to users rather than a multi-tiered system, preserv- 
ing the current user experience and therefore requiring 
less training to use. Somniloquy also gives the impres- 
sion to remote application servers that a device remains 
awake all the time even though it 1s actually asleep, since 
the same MAC and IP addresses are used. This level of 
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transparency is not provided either by Wake-on- Wireless 
or Turducken. Finally, we have gone into more detail 
than previous work on ways of supporting applications 
that require interactions among the secondary and the 
host processor to perform offload — such as IM, BitTor- 
rent and web downloads. 

To reduce the power consumed by desktop PCs, some 
early proposals have suggested the use of proxies on the 
subnet that function on behalf of the desktop PC when it 
is asleep [4, 7, 11]. The proxy monitors incoming pack- 
ets for the PC, and wakes it up using WoL when the PC 
needs to handle the packet. We are not aware of any pub- 
lished prototype implementations of such systems. Re- 
cently, Sabhanatarajan et. al. [25] propose a smart NIC 
that can act as proxy for a host to save power. How- 
ever, the authors focus primarily on the design of a high 
speed packet classifier for such an interface. In compar- 
ison, Somniloquy has much wider applicability than the 
above schemes. It can be used in homes and small offices 
where it might be infeasible to deploy a dedicated server 
to handle processing for another PC. 

A contemporaneous effort to Somniloquy is the idea 
of a Network Connection Proxy (NCP) [15, 20], which 
is anetwork entity that maintains the presence of a sleep- 
ing PC. In [15], the authors define the requirements of 
an NCP and propose modifications to the socket layer 
(similar to Split TCP) for keeping TCP connections alive 
through a PC’s sleep transitions. In [20], the authors ex- 
tend these APIs to support other protocols as well. Som- 
niloquy is similar in spirit to NCP, and NCP’s socket 
APIs can reduce Somniloquy’s overhead when waking 
up from sleep (Section 3.1). Furthermore, to the best of 
our knowledge, Somniloquy is the first published proto- 
type of any proxying system. 

We note that the concept of adding more process- 
ing to the network interface is not new. Existing prod- 
ucts offload processing to the NIC to improve perfor- 
mance (TCP offload [19]) and remote manageability (In- 
tel AMT [13]). Somniloquy uses a similar offloading 
paradigm, but to conserve energy instead of improving 
performance or manageability. 


7 Conclusions 


We have presented Somniloquy, a system that augments 
network interfaces to allow PCs to be put into low-power 
sleep states opportunistically, without sacrificing func- 
tionality. Somniloquy enables several new energy sav- 
ing opportunities. First, PCs can be put to sleep while 
maintaining network reachability, without special net- 
work infrastructure as needed by previous solutions (e.g. 
WoL). Second, some applications can be run in sleep 
mode thereby requiring much less power. In this paper, 
we have shown the feasibility for three such applications 


to be run in sleep mode: BitTorrent, instant messaging, 
and web downloads. 

Somniloquy achieves these energy savings without re- 
quiring any modifications to network, to remote appli- 
cation servers, or to the user experience of the PC. Fur- 
thermore, Somniloquy can be incrementally deployed on 
legacy network interfaces, and does not rely on changes 
to the CPU scheduler or the memory manager to imple- 
ment this functionality, thus it is compatible with a wide 
class of machines and operating systems. 

Our prototype implementation, based on a USB pe- 
ripheral, includes support for waking up the PC on net- 
work events such as incoming file copy requests, VoIP 
calls, instant messages and remote desktop connections, 
and we have also demonstrated that file sharing/content 
distribution systems (e.g. BitTorrent, web downloads) 
can run in the augmented network interface, allowing for 
file downloads to progress without the PC being awake. 
Our tests show power savings of 24x are possible for 
desktop PCs left on when idle, or 11x for laptops. For 
PCs that are left idle most of the time, this translates to 
energy savings of 60% to 80%. The electricity savings 
made are such that deploying a productized version of 
Somniloquy could pay for itself within a year. 
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Skilled in the Art of Being Idle: 
Reducing Energy Waste in Networked Systems 
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Abstract 


Networked end-systems such as desktops and set-top 
boxes are often left powered-on, but idle, leading to 
wasted energy consumption. An alternative would be for 
these idle systems to enter low-power sleep modes. Un- 
fortunately, today, a sleeping system sees degraded func- 
tionality: first, a sleeping device loses its network “‘pres- 
ence” which is problematic to users and applications that 
expect to maintain access to a remote machine and, sec- 
ond, sleeping can prevent running tasks scheduled dur- 
ing times of low utilization (e.g., network backups). Var- 
ious solutions to these problems have been proposed over 
the years including wake-on-lan (WoL) mechanisms that 
wake hosts when specific packets arrive, and the use of a 
proxy that handles idle-time traffic on behalf of a sleep- 
ing host. As of yet, however, an in-depth evaluation of 
the potential for energy savings, and the effectiveness of 
proposed solutions has not been carried out. To remedy 
this, in this paper, we collect data directly from 250 en- 
terprise users on their end-host machines capturing net- 
work traffic patterns and user presence indicators. With 
this data, we answer several questions: what is the po- 
tential value of proxying or using magic packets? which 
protocols and applications require proxying? how com- 
prehensive does proxying need to be for energy benefits 
to be compelling? and so on. 


We find that, although there is indeed much potential 
for energy savings, trivial approaches are not effective. 
We also find that achieving substantial savings requires a 
careful consideration of the tradeoffs between the proxy 
complexity and the idle-time functionality available to 
users, and that these tradeoffs vary with user environ- 
ment. Based on our findings, we propose and evaluate 
a proxy architecture that exposes a minimal set of APIs 
to support different forms of idle-time behavior. 
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1 Introduction 


Recent years have seen rising concern over the energy 
consumption of our computing infrastructure. A recent 
study [19] estimates that, in the U.S. alone, energy con- 
sumption for networked systems approaches 150 TWh, 
with an associated cost of around 15 billion dollars. 
About 75% of this consumption can be attributed to 
homes and enterprises, and the remaining 25% to net- 
works and data centers. Our focus in this paper is on re- 
ducing the 75% consumed in homes and enterprises. To 
put this in perspective, this energy (112 TWh) is roughly 
equivalent to the yearly output of 6 nuclear plants [14]. 
Of equal concern is that this consumption has grown — 
and continues to grow — at a rapid pace. 

In response to these energy concerns, computer ven- 
dors have developed sophisticated power management 
techniques that offer various options by which to reduce 
computer power consumption. Broadly, these techniques 
all build on hardware support for s/eep (S-states), and 
frequency/voltage scaling [21] (processor P-states [4]). 
The former is intended to reduce power consumption 
during idle times, by powering down sub-components 
to different extents, while the latter reduces power con- 
sumption while active, by lowering processor operating 
frequency and voltage during active periods of low sys- 
tem utilization. 

Of these, sleep modes offer the greatest reduction in 
the power draw of machines that are id/e. For example, a 
typical sleeping desktop draws no more than 5W [2], as 
compared to at least SOW [2] when on, but idle — an order 
of magnitude reduction. It is thus unfortunate that sleep 
modes are not taken advantage of to anywhere close to 
their fullest potential. Surveys of office buildings have 
shown that about two thirds of desktops are fully on at 
night [20], with only 4% asleep. Our own measurements 
(Section 3) reveal that enterprise desktops remain idle for 
an average of 12 hours/day — time that could, in theory, 
be spent mostly sleeping. 

Relative to an idle machine, the only loss of functional- 
ity to a sleeping machine 1s twofold. First, since a sleep- 
ing computer cannot receive or transmit network mes- 
sages, it effectively loses its “presence” on the network. 
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This can lead to broken connections and sessions when 
the machine resumes (e.g., a sleeping machine does not 
renew its DHCP lease and hence loses its IP address) 
and also prevents remote access to a sleeping computer. 
This loss of functionality is problematic in an increas- 
ingly networked world. For example, a user at home 
might want to access files on his desktop at work, an 
on-the-road user might want to download files from his 
home machine to his handheld, system administrators 
might desire access to enterprise machines for software 
updates, security checks and so forth. In fact, some en- 
terprises, require that users not power off their desk- 
tops to ensure administrators can access machines at all 
times [6]. The second problematic scenario is when users 
or administrators deliberately want to schedule tasks to 
run during idle times — e.g., network backups that run 
at night, critical software updates, and so on. Unfortu- 
nately, these drawbacks cause users to forego the use of 
sleep modes leading to wasteful energy consumption. 

The above observations are not new, having been re- 
peatedly articulated (also by some of the authors) in both 
the technical literature and popular press [13, 16, 19, 10, 
7, 15]. Likewise, there have been two long-standing pro- 
posals on how to tackle the problem. The first is to gen- 
eralize the old technology of Wake-on-LAN (WoL), an 
Ethernet computer networking standard that allows a ma- 
chine to be turned on or woken up remotely by a special 
“magic packet’. A second, more heavyweight, proposal 
has been to use a proxy that handles idle-time traffic on 
behalf of a sleeping host(s), waking the sleeping host 
when appropriate. Thus both problem (wasted energy 
consumption by idle computers) and proposed solutions 
(wake-up packets and/or proxies for sleeping machines) 
have existed for a while now. In fact, the technology for 
WoL has been implemented and deployed although not 
widely used (we explore possible causes for this later 
in the paper). However the recent focus on energy con- 
sumption has led to renewed interest in the topic with 
calls for research [7, 13], calls for standardization [12], 
and even some commercial prototypes [15]. As yet how- 
ever, there has been little systematic and in-depth evalua- 
tion of the problem or its solutions — what savings might 
such solutions enable? what is the broader design space 
for solutions? what, if any, might be the role of standard- 
ization? are these the right long-term solutions? etc. 

In this paper, we explore these questions by studying 
user behavior and network traffic in an enterprise envi- 
ronment. Specifically, we focus on answering the follow- 
ing questions: 


Q1: Is the problem worth solving? Just how much 
energy is squandered due to poor computer sleeping 
habits? This will tell us the potential energy savings these 
solutions stand to enable and hence the complexity they 
warrant. Also, is proxying really needed to realize these 
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potential savings or can we hope that WoL suffices to 
maintain network presence while still sleeping usefully? 


Q2: What network traffic do idle machines see? Un- 
derstanding this will shed light on how this idle-time traf- 
fic might be dealt with and, consequently, what protocols 
and applications might trigger wake-up packets and/or 
require proxying. On the face of it, it would seem like 
an idle machine ought not to be engaged in much useful 
activity and hence, ideally, one might hope that a small 
number of wake-up events are required and/or that a rel- 
atively small set of protocols must be proxied to realize 
useful savings. 


Q3: What is the design space for a proxy? In general, 
the space appears large. Different proxy implementations 
might vary in the complexity they undertake in terms of 
what work is handled by the proxy vs. waking the ma- 
chine to do so. In some cases, one might opt for a rela- 
tively simple proxy that (for example) only responds to 
certain protocols such as ARP (specified by the DMTF 
ASF2.0 standard[1]) and NetBios. But more complex 
proxies are also conceivable. For example, a proxy might 
take on application-specific processing such as initiat- 
ing/completing BitTorrent downloads during idle times 
and so forth. Likewise, there are many conceivable de- 
ployment options — a proxy might run at a network mid- 
dlebox (e.g., firewall, NAT, efc.), at a separate machine 
on each subnet, or even at individual machines (e.g., on 
its NIC, on a motherboard controller, or on a USB- 
attached lightweight microengine). Given this breadth 
of options, we are interested in whether one can iden- 
tify a minimal proxy architecture that exposes a set of 
open APIs that would accommodate a spectrum of design 
choices and deployment models. Doing so appears im- 
portant because a proxy potentially interacts with a diver- 
sity of system components and even vendors (hardware 
power management, operating systems, higher-layer ap- 
plications, network switches, NICs, etc.) and hence iden- 
tifying acore set of open APIs would allow different ven- 
dors to co-exist and yet innovate independently. For ex- 
ample, an application developer should be able to define 
the manner in which his application interacts with the 
proxy with no concern for whether the proxy is deployed 
at a firewall, a separate machine or a NIC. 


Q4: What implications does proxying have for future 
protocol and system design? The need for a proxy 
arises largely because network protocols and applica- 
tions were never designed with energy efficiency in mind 
nor to usefully exploit, or even co-exist with, power man- 
agement in modern PCs and operating systems. While 
proxies offer a pragmatic approach to dealing with this 
mismatch for currently deployed protocols and software, 
one might also take a longer-term view of the problem 
and ask how we might redesign protocols, applications 
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or even hardware power management to eventually obvi- 
ate the need for such proxying altogether. 

In this paper, we study the network-related behavior of 
250 users and machines in enterprise and home environ- 
ments, and evaluate each of the above questions in Sec- 
tions 3 to 6 respectively. 


2 Measurement data and methodology 


We collected network and user-level activity traces from 
approximately 250 client machines belonging to Intel 
corporation employees, for a period of approximately 
5 weeks. The machines, running Windows XP, include 
both desktops and notebooks—approximately 10% are 
desktops and the rest, notebooks. 

Our trace collection software was run at the individ- 
ual end-hosts themselves and hence, in the case of note- 
books, trace collection continued uninterrupted as the 
user moved between enterprise and home, enabling us 
to analyze traffic from both of these environments. 

Our packet traces were collected using Windump. To 
capture user activity, we developed an application that 
sampled a number of user activity indicators at one sec- 
ond intervals. The user activity indicators we collected 
included keyboard activity and mouse movements and 
clicks. Noticeable gaps in the traces occur when the host 
was turned off, put to sleep, or in hibernation. Thus each 
end-host is associated with a trace of its network and user 
activity. We then used BRO [9] to reassemble connection- 
level information from each packet-level trace. 

Thus, for the 5 week duration of our measurement 
study, we have the following information for each end- 
host: 

e a packet-level (pcap) trace capturing packet headers 
for the entire duration 

e per-second indicators of user presence at the machine 
e the set of all connections—incoming and outgoing— 
as reconstructed by BRO from the packet traces 

The result is a SOOGB repository of trace data. To pro- 
cess this, we developed a custom tool that extends the 
publicly available WIRESHARK [3] network protocol an- 
alyzer with different function callbacks implementing 
the additional processing required for our study. 


3 Low Power Proxying: Potential and Need 


In this section, we estimate the energy wasted by home 
and office computers that remain powered on even when 
idle, i.e., even when there is no human interacting with 
the computer. Subsequently, we investigate whether very 
simple approaches — e.g., the computer is woken up to 
process every network packet and then returns to sleep 
immediately after—would suffice in allowing hosts to 
sleep more while preserving their network “presence”. 


How much energy is squandered by not sleeping? 
Virtually all modern computers support advanced sleep 
states, S1 - S4 as defined in the ACPI specification [5]. 
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Figure 1: Distribution of the split among off, idle and 
active periods across users. 


These states vary in their characteristics—whether the 
CPU is powered off, how much memory state is lost, 
which buses are clocked and so on. However, common 
to all states, is that the CPU stops executing instructions 
and hence the computer appears to be powered down. 
Thus although these sleep states conserve energy, the un- 
desirable side-effect is that a sleeping computer effec- 
tively “falls off’ the network—making it unavailable for 
remote access and unable to perform routine tasks that 
may have been scheduled at particular times. This leads 
many users to disable power management altogether and 
instead leave machines running 24/7. For example, stud- 
ies have shown that approximately 60% of the PCs in of- 
fice buildings remain powered on overnight and almost 
all of these have power management disabled [20]. 

To more carefully quantify the amount of wasted en- 
ergy (and hence potential savings), we analyzed the trace 
data collected at our enterprise machines. To determine 
whether a machine has a locally present and active user, 
we examine the recorded mouse and keyboard activity 
for the machine: if no such activity is recorded for 15 
minutes, we say that the machine is idle. We use 15 min- 
utes because it is the default timeout recommended by 
EnergyStar for putting machines to sleep, and because it 
represents a simple (and fairly liberal) approximation for 
the notion of idle-ness, for which a standard definition 
does not exist. We maintain this definition of idle-ness 
for the remainder of the paper. 

At any point in time, we classify a machine as being in 
one of four possible states: (a) on, and actively used, we 
call this active; (b) on, but not used, idle; (c) in a sleep 
state such as S3 or S4, and (d) powered down, off. Note 
that this notion of “idle” refers here to the user, and not 
the machine, being inactive. 

In Figure 1 we present this data for our enterprise desk- 
tops. We focus here on the desktops since this represents 
the potential energy savings an enterprise could garner. 
Because the bulk of our traces come from mobile users, 
we have a limited number of desktops. We see that the 
fraction of time when these machines are active is quite 
low, falling below 10% on average. Moreover, the aver- 
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age fraction of time when machines are idle is high — 
about 50%. Similar to other studies, we note that a small 
fraction of our desktops (only 5 out 24) use sleep mode 
at all. Overall, this indicates that there is a tremendous 
opportunity for energy savings on enterprise desktops. 
The opportunity on our corporate laptops exists too, but 
is moderate because we found that our laptop users were 
more likely to employ aggressive sleeping configurations 
that come pre-configured on laptops. 

While the sample of the desktop machines in our exper- 
iments is small, the results are consistent with existing 
studies [20]. We therefore use these measured idle times 
to extrapolate the energy that could be saved by sleeping 
instead of remaining idle. There are estimated to be about 
170 million desktop PCs in the US (data summarized in 
[23]). Assuming an 80W power consumption of an idle 
PC, and assuming these machines are idle for 50% of the 
time, this amounts to roughly 60 TWh/year of wasted 
electricity (or 6 billion dollars, at US$0.10 per kWh). 


Is low-power proxying needed? Before developing 
new solutions to reducing host idle times, we investigate 
whether very simple approaches like waking up for ev- 
ery packet can deliver these savings while maintaining 
full network presence. In this approach, which we denote 
(WoP — wake on packet), the machine is woken up for 
every packet it needs to receive (directed or broadcast), 
and put back to sleep after the packet is served. The per- 
formance of such an approach depends on whether the 
inter-packet gap (IPG) is smaller or comparable to the 
time it takes to transition in and out of sleep. If it isn’t 
then there is no gain over simply leaving the machine in 
an idle state. 

To examine the traffic during idle times, we used both 
our desktop and laptop machines. We consider both types 
(even though we’re primarily interested in desktops) be- 
cause this gives us a significantly larger set of samples. 
We separate the idle time traffic into two categories, of- 
fice and home. In Figure 2 we plot the average number of 
packets/sec for idle traffic both in the office and at home. 
In the office environment, the average number of packets 
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per second is roughly 3, while at home it is roughly 1. 
This indicates a fairly constant level of background chat- 
ter on the network, independent of the user’s activity. Be- 
cause this number is an average, we need to understand 
if these packets occur in bursts or not. If the packets are 
bursty most of the time, then there may still be opportu- 
nities to sleep as the host can be woken up to service a 
burst of packets and then be put to sleep for some reason- 
able period of time (certainly more than a few seconds). 
If these packets occur fairly evenly spaced, then it 1s not 
worth going to sleep unless the time to transition in and 
out of sleep is very small (on the order of 1 to 3 seconds). 

To quantify the burstiness level of our traffic, we group 
inter-packet gaps into second-long bins (i.e., O-1s, 1-2s, 
etc.). We then compute the sum of the inter-packet gaps 
in each of these bins, and finally compute the fraction 
of total idle time represented by each bin. We present 
these results in Figure 3, for both home and office envi- 
ronments. In the office, over 90% of the time, the IPG 
is less than 2 seconds. Although the distribution is more 
uniformly spread for the home environment, we still see 
that roughly 70% of the time, the IPG is less than 20 
seconds. Overall we observe that: (a) neither of the en- 
vironments enjoys many long periods of quiet time; (b) 
we find this distribution to be very different for the two 
environments. In home networks the distribution has a 
much heavier tail, the traffic is burstier, and we do see 
longer periods of quiet time. 

We now translate these observations into actual sleep 
time. In order to perform this computation, we must con- 
sider a representative value for the time interval it takes 
the host to wake up, process the packet and then go to 
sleep again—we call this the transition time, denoted f,. 
Today, typical machines take 3 — 8 seconds to enter S3 
sleep, and 3 — 5 seconds to fully resume from S3, as mea- 
sured in a recent study [6]. Therefore, it is reasonable to 
assume an average transition time t, of 10s. 

When a packet arrives, the machine is woken up to 
serve the packet. After processing a packet, the machine 
only goes to sleep again if it knows the next packet will 
not arrive before it transitions to sleep. This idealized test 
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thus assumes that the host knows the future incoming 
packet stream and captures the best the machine could 
do in terms of energy savings. 

Figure 4 presents the fraction of idle time for which 
users can sleep, assuming the policy described above. 
The results are rather dramatically different for across 
environments. In the office, there is almost no oppor- 
tunity to sleep for the majority of the users. This indi- 
cates that the magic packet-like approach will not suc- 
ceed in saving any energy for machines in a typical cor- 
porate office environment. For the home environment, 
we see that roughly half the users can sleep for over 
50% of their idle times. Thus in these environments, a 
10s transition time coupled with a WoP type policy can 
be somewhat effective. However, these estimates assume 
perfect knowledge of future traffic arrivals and also fre- 
quent transitions in and out of sleep—in practice, we ex- 
pect the achievable savings would be somewhat lower. 
Nonetheless, this does suggest that efforts to reduce sys- 
tem transition times in future hardware could obviate the 
need for more complex power-saving strategies in certain 
environments. 

We conclude that while significant opportunity for 
sleep exists, capitalizing on this opportunity requires so- 
lutions that go beyond merely waking the host to han- 
dle network traffic; we thus consider solutions based on 
proxying idle-time traffic in the following sections. 


4 Deconstructing traffic 


In the previous section we saw that, by just waking up 
to handle all packets, our ability to increase a machine’s 
sleep time is limited. In particular, we see virtually no 
energy savings in the dominant office environments. This 
suggests that we need an approach that is more discrim- 
inating in choosing when to wake hosts. This leads us to 
an alternate solution to the WoL which is to employ a 
network proxy whose job is to handle idle-time traffic on 
behalf of one or more sleeping hosts. Packets destined 
for a sleeping host are intercepted by (or routed to, de- 
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Figure 5: Composition of incoming and outgoing traf- 
fic during idle times, for home and office environments, 
based on communication paradigms 


pending on the proxy deployment model) its proxy. At 
this point, the proxy must know what to do with this in- 
tercepted traffic; broadly, the proxy must choose between 
three reactions: a) ignore/drop the packet; b) respond to 
the packet on behalf of the machine; or c) wake up the 
machine to service it. To make a judicious choice, the 
proxy must have some knowledge of network traffic— 
what traffic is safely ignorable, what applications do 
packets belong to, which applications are essential, and 
so forth. In this section, we do a top-down deconstruc- 
tion of the idle-time traffic traces aimed at learning the 
answers to these questions. 


4.1 Traffic Classes by Communication Paradigm 


To begin, we look at all packets exchanged during idle 
periods, and classify each packet as either being a broad- 
cast, multicast or unicast packet. Within these broad traf- 
fic classes, we further partition the traffic by whether the 
packets are incoming or outgoing, for both the home and 
office environments. We separate incoming and outgoing 
traffic because we expect them to look different in terms 
of the proportion of each class in different directions 
(e.g., most end-hosts ought to send little broadcast traf- 
fic). Similarly, we look at different usage environments 
because it is intuitive that the dominant protocols and ap- 
plications used in each environment may differ. Since we 
expect these differences, we treat them as such to avoid 
mischaracterizations. The breakdown of our traffic ac- 
cording to all these partitions in depicted in Fig. 5. 

We note that outgoing traffic is dominated by unicast 
traffic since, as expected, each host generates little broad- 
cast or multicast traffic. We also find that incoming traffic 
at a host sees significant proportions of al/ three classes 
of traffic, and this is true in both enterprise and home 
environments. This suggests that a power-saving proxy 
might have to tackle all three traffic classes to see signif- 
icant savings. 

So far, we looked at traffic volumes as indicative of the 
need to proxy the corresponding traffic type. We now di- 
rectly evaluate the opportunity for sleep represented by 
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each traffic type. To understand the maximum sleeping 
opportunities, we consider for a moment an idealized 
scenario in which we use our proxy to ignore al/ incom- 
ing packets from either or both of the broadcast and mul- 
ticast traffic classes. A machine always wakes up for uni- 
cast packets. Fig. 6 shows the sleep potential in four sce- 
narios: a) ignore only broadcast and wake for the rest; 
c) ignore only multicast and wake for the rest; c) ignore 
both broadcast and multicast. For comparison purposes 
we also include the results for a scenario d) in which 
we wake up for all packets. This comparison allows us 
to compare the benefits derived from these four different 
proxy policies. For each user, we computed the fraction 
of its idle time that could have been spent sleeping un- 
der the scenario in question. We use a transition time of 
t, = 10s and the results are averaged over 250 users for 
both home and office environments. 

We make the following observations: 

(i) Broadcast and multicast are largely responsible for 
poor sleep. If we can proxy these, then we can recuper- 
ate over 80% of the idle time in home environments. And 
in the office, where previously sleep was barely possible, 
we can now sleep for over 50% of the idle time. 

(11) Doing away with only one of either broadcast or 
multicast is not very effective (we suspect this is due to 
the periodicity of multicast and broadcast protocols, and 
evaluate this in later sections). 

More generally, the graph clearly indicates a valuable 
conclusion—if we’re looking to narrow the set of traf- 
fic classes to proxy, then multicast and broadcast traf- 
fic appear to be clear low-hanging fruit and should be 
our primary candidates for proxying. That said, proxying 
unicast traffic appears key to achieving higher savings 
(beyond 50%) in the enterprise and hence should not be 
dismissed either. We thus continue, for now, to study all 
three traffic types. 

Of course, whether these potential savings can actually 
be realized depends on whether a particular traffic type 
can indeed be handled by a proxy without waking the 
host. This depends on the specific protocols and applica- 
tions within that class and hence, in the remainder of this 
section, we proceed in turn to deconstruct each of broad- 
cast (84.2), multicast (84.3) and unicast (84.4) traffic. 


4.2 Deconstructing Broadcast 


Our goal in this section is to evaluate individual broad- 
cast protocols, looking for: (1) which of these protocols 
are the main offenders in terms of preventing hosts from 
sleeping and, (2) what purpose do these protocols serve 
and how might a proxy handle them. Answering the first 
question requires a measure of protocol “badness” with 
respect to preventing hosts from sleeping. We use two 
metrics for our evaluation. The first is simply the total 
volume of traffic due to the protocol in question. While 
high-volume traffic often makes sleep harder, this is an 
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Figure 6: Average sleep opportunity when ignoring mul- 
ticast and/or broadcast traffic, for different environments 


imperfect metric since the (in)ability to sleep depends as 
much on the precise temporal packet arrival pattern due 
to the protocol as on packet volumes. Nonetheless, we re- 
tain traffic-volume as an intuitive, although indirect mea- 
sure of protocol badness. Our second metric—which we 
term the half-sleep time, denoted ts_50 — more directly 
measures a protocol’s role in preventing sleep. 

We define the half-sleep time for a protocol (or traffic 
type) P as the largest host transition time that would be 
required to allow the host to sleep for at least 50% of its 
idle time, under the scenario where the machine wakes 
up for all packets of type P and ignores all other traffic. 
In effect, t s_50 quantifies the intuition that, if we ignore 
all traffic other than that due to the protocol of interest, 
then a protocol whose packets arrive spaced far enough 
apart in time is more conducive to sleep since the host 
has sufficient time to transition in and out of sleep. 

In more detail, tS_50 is computed from our traces as 
follows. We measure the total time a given host can sleep 
assuming it wakes up for all the packets of the protocol 
under consideration and ignores all others. We compute 
this number for all hosts and take the average. This gives 
us an upper bound on achievable sleep if the protocol 
is handled by waking the host. We estimate this sleep 
duration for different values of the host transition time t, 
ranging from 0 seconds (ideal) to 15 minutes. The largest 
of these transition times t, that allows the host to sleep 
for over 50% of its idle time is the protocol’s ts_50 . 

Intuitively, ts_50 indicates the extent to which a pro- 
tocol is “sleep friendly” since protocols with large val- 
ues of ts_50 could simply be handled by allowing the 
machine to wake up; whereas those with low values of 
ts_50 imply that (to achieve useful sleep) the proxy 
must handle such traffic without waking the host. 

For our evaluation, we classify each packet by protocol 
and rank them by both metrics: traffic volume and the 
half-sleep time. We begin by measuring traffic volume, 
we then establish the top ranking protocols by volume, 
and use these as candidates for our second metric, the 
half-sleep time. When presenting the top ranking proto- 
cols by each of the metrics, we consider : (1) the proto- 
cols whose traffic volumes represents more than 1% of 
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Figure 7: Protocol composition of incoming broadcast 
traffic, in both office and home environments, ranked by 
per-protocol traffic volumes. 
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Figure 8: Protocol composition for broadcast protocols 
ranked by ts_50. 


the total traffic at the host and (2) the protocols with a 
half-sleep time of less than 15 minutes. Table 7 and 8 
present our results for broadcast traffic. For complete- 
ness, we also present the value of ts_50 when consider- 
ing all broadcast traffic together. 

In terms of traffic volumes, we see that the bulk of 
broadcast traffic is in the cause of address resolution and 
various service discovery protocols (e.g., ARP, Netbios 
Name Service — NBNS, the Simple Service Discovery 
Protocol used by UPnP devices — SSDP ). These proto- 
cols are well represented in both home and office LANs. 
A second well-represented category of traffic is from 
router-specific protocols (e.g., routing protocols imple- 
mented on top of the IPX). 

In terms of the half-sleep time, we see that broadcast 
as a whole allows very little sleep in the office: achiev- 
ing 50% sleep would require very fast transitions (be- 
tween | and 2 seconds), not feasible with today’s hard- 
ware support. The situation in home LANs is signifi- 
cantly better (ts_50 = 10s). In terms of protocols, we 
see that the greatest offenders are similar to those from 
our traffic-volume analysis, namely: ARP, Netbios Data- 
grams (NBDGM) and Name Queries (NBNS), and IPX. 

On closer examination, we find that most of these of- 
fending protocols could be easily handled by a proxy: 
for example, IPX is safely ignorable, ARP traffic that is 
not destined to the machine in question is likewise safely 
ignorable; for ARP queries destined to the machine, it 
would be fairly straightforward for a proxy to automati- 
cally construct and generate the requisite response with- 
out having to wake the host. 
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Figure 9: Protocol composition for incoming multicast 
traffic, in both office and home enviroments, ranked by 
per-protocol traffic volumes. 
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Figure 10: Protocol composition for incoming multicast 
traffic, in both office and home environments, ranked by 
ts_50. 


4.3. Deconstructing Multicast 


Table 9 and 10 present our protocol rankings for 
multicast traffic. Again, we also present the value of 
ts_50 when considering all multicast traffic taken to- 
gether. We see that, multicast traffic (as a whole) can 
be a bad offender in enterprise environments with an 
ts_50 = O—1s. It turns out that this is largely caused by 
router traffic—the Hot Standby Router Protocol (HSRP), 
Protocol Independent Multicast (PIM), EIGRP, etc. 

This traffic is either absent (e.g., PIM) or greatly re- 
duced (e.g., HSRP) in home environments which ex- 
plains why multicast is much less problematic in homes, 
with an ts_50 = 1 — 2 minutes (compared to 10 — 20s 
for broadcast). 

The good news is that all router traffic (HSRP, PIM, 
IGRP) is safely ignorable. In fact, many modern Ether- 
net cards already include a hardware multicast filter that 
discards most unwanted multicast traffic. 

As with broadcast traffic, we also see significant traffic 
contributed by service discovery protocols: in this case 
SSDP, the Simple Service Discovery Protocol used by 
UPnP devices. Once again, for protocols such as SSDP 
and IGMP, it is fairly straightforward for a proxy to auto- 
matically respond to incoming traffic without waking the 
host; doing so would require some amount of state at the 
proxy such as the list of multicast groups the interface 
belongs to and the services running on the machine. 


4.4 Deconstructing Unicast 


Finally, we present our protocol ranking for unicast traf- 
fic in Tables 11 and 12. Because much of unicast traf- 
fic is either TCP or UDP, and this level of classifica- 
tion is unlikely to be informative, we further break each 
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Figure 11: Protocol composition of incoming unicast 
traffic in office enviroments, ranked by per-protocol traf- 
fic volumes. 
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Figure 12: Protocol composition of incoming unicast 
traffic in office environments, ranked by ts_50 . 
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Figure 13: Protocol composition for unicast traffic based 
on TCP and UDP ports, ranked by ts_50 


down by session-layer protocol with an additional map- 
ping from ports in Table 13. Unfortunately, unlike the 
case of broadcast and multicast, with unicast, it is harder 
to deduce the ultimate purpose for much of this traffic 
since even the session or application-level protocol iden- 
tifiers are fairly generic. (One exception is the “BigFix” 
application listed in Fig. 13. BigFix is an enterprise soft- 
ware patching service that checks security compliance of 
enterprise machines; based on the frequency and volume 
of BigFix traffic we see, it appears to have been config- 
ured by an over-zealous system administrator. ) 

Stymied in our attempts to deconstruct unicast traffic 
based on whether and how it might be proxied, we try 
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Figure 14: Fraction of packets generated by incoming vs. 
outgoing connections. For home and office, both received 
and transmitted packets. 


an alternate strategy. We classify TCP and UDP pack- 
ets based on the connections they belong to and catego- 
rize connections as incoming vs. outgoing. Our interest 
in this classification is because we suspect that a large 
portion of packets are likely to belong to outgoing con- 
nections. And while a host might wake for incoming con- 
nections, waking for outgoing connections might well be 
avoidable (for reasons discussed below). From the results 
in Fig. 14, we see that outgoing connections do indeed 
dominate. Now for a sleeping machine, there are three 
possibilities for these outgoing connections: (1) the con- 
nection was initiated by the host before the idle pertod— 
in this case, such traffic might not be ignorable if the 
host/proxy wants to maintain this connection, hence we 
hope this percentage of traffic is small, (2) the connec- 
tion was initiated but failed (3) the connection was ini- 
tiated by the host after the start of the idle period; for 
a sleeping host, these connections would either simply 
never have been initiated (if the connection were deemed 
unncessary) or, the host would be deliberately woken to 
initiate these connections (if the connection were deemed 
necessary, as for services scheduled to run during idle 
times). For the former, the traffic can simply be ignored 
from our accounting and, in the latter case, such sched- 
uled processing is easily batched and hence needn’t dis- 
rupt sleep. Hence for all but the first case, waking the 
machine might be avoidable. We plot this breakdown of 
outgoing connections in Figure 15. We see that only a 
relatively small percentage of outgoing connections — al- 
ways less than 25% — belong to the first category which 
might require waking the host. Based on this, we specu- 
late that, it might be possible to eliminate or ignore much 
of even unicast traffic. 


Early in this section, we asked whether one might iden- 
tify a small set of of protocols or proxy behaviors that 
could yield significant savings. We find that, the answer 
is positive in the case of multicast and broadcast but less 
clear for unicast traffic. In the next section we consider 
the implications of our traffic analysis for proxy design. 
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failed connection attempts. 


5 A Measurement-driven Approach to 
Proxy Design 


Having studied the nature of idle-time traffic, we now ap- 
ply our findings to the design of a practical power-saving 
proxy. We start in Section 5.1 by extracting the high-level 
design implications of our traffic analysis from the previ- 
ous section. Building on this, in Section 5.2, we illustrate 
the space of design tradeoffs by considering four specific 
examples of proxies. In Section 5.3, we distill our find- 
ings into a proposal for a core proxy architecture that of- 
fers a single framework capable of supporting the broad 
design space we identify. 


5.1 Design Implications 


At minimum, a power-saving proxy should: (a) allow the 
host to sleep for a significant fraction of the time, and 
(b) maintain the basic network presence of the host by 
ensuring remote entities can still address and reach the 
machine and the services it supports. Beyond this, we 
have a significant margin of freedom in choosing how a 
proxy might handle the remaining idle-time traffic and 
applications. Viewed through this lens, our results from 
Section 4 lead us to differentiate idle-time traffic along 
two different dimensions. The first classifies traffic based 
on the need to proxy the traffic in question: 

(1) don’t-wake protocols: these are protocols that gen- 
erate sustained and periodic traffic and hence, ideally, 
would be dealt with (by a proxy) without waking the host 
since otherwise the host would enjoy little sleep. Exam- 
ples of such protocols identified in the previous section 
include IGMP, PIM, ARP. Table 1 lists a set of protocols 
we Classify as don’t-wake. 

(2) don’t-ignore protocols: these are protocols that re- 
quire attention to ensure the correct operation of higher- 
layer protocols and applications. For example, we must 
ensure the DHCP lease on an IP address must be main- 
tained and that a machine must respond to NetBIOS 
name queries to ensure the services it runs over NetBIOS 
remain addressable. The protocols we identified as don’t- 
ignore are listed in Table 1. Note that the list of don’t- 
wake and don’t-ignore protocols need not be mutually 


HSRP, ARP, PIM, NBDGM, 
SSDP 


Don’t ARP (for me), NBNS, DHCP (for me) 
ignore 


Table 1: Protocols that shouldn’t cause a wake up (too expen- 
sive in terms of sleep), and protocols that should not be ignored 
(for correctness). 


Ignorable HSRP, PIM, ARP (for others), IPX, LLC, 
EIGRP, DHCP 


Protocol| State 
IP address 
NB names of machine and 
local services 
Names of local plug-n-play 
Services 


ICMP, IGMP, 


Don’t 
wake 





Mechanical 
Response 


Multicast groups the inter- 
face belongs to 

IP address 

NB names of machine and 
local services. Ignores pkKts. 
not destined to host, wakes 
host for rest 





Table 2: Protocols that can be handled by ignoring or by me- 
chanical response. We classify DHCP as ignorable because we 
choose to schedule the machine to wake up and issue DHCP 
requests to renew the IP lease — an infrequent event. 


exclusive; for example, ARP traffic is both frequent and 
critical and hence falls under both categories. 

(3) policy-dependent traffic: for the remainder of traf- 
fic, the choice of whether and how a proxy should handle 
the traffic is a matter of the tradeoff the user (or soft- 
ware designer) is seeking to achieve between the sophis- 
tication of idle-time functionality, the complexity of the 
proxy implementation and energy savings. We shall ex- 
plore these tradeoffs in the context of concrete proxy im- 
plementations in Section 5.2. 

A complementary dimension along which we can clas- 
sify traffic is based on the complexity required to proxy 
the traffic in question: 

(A) ignorable (drop): this is traffic that can safely be 
ignored. Section 4 identified several such protocols and 
the top ranked. of these are listed in Table 2. Comparing 
Tables 1 and 2, we see that (fortunately) there is a sig- 
nificant overlap between don’ t-wake and ignorable 
protocols. Policy-dependent traffic/applications that are 
deemed unimportant to maintain during idle times could 
likewise be ignored while don’ t-ignore protocols 
obviously cannot be. 

(B) handled via mechanical responses: this includes in- 
coming (outgoing) protocol traffic for which it is easy to 
construct the required response (request) using little-to- 
no state transferred from the sleeping ho.nction is some- 
what subjective, based For example, a proxy can easily 
respond to NetBIOS Name Queries asking about local 
NetBIOS services, once these services are known by the 
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proxy. Table 2 lists key protocols that can be dealt with 
through mechanical responses. 

(C) require specialized processing: this covers proto- 
col traffic that, if proxied, would require more complex 
state maintenance (transfer, creation, processing and up- 
date) between the proxy and host. For example, consider 
a proxy that takes on the role of completing ongoing p2p 
downloads on behalf of a sleeping host — this requires 
that the proxy learn the status of ongoing and sched- 
uled downloads, the addresses of peers, etc. and more- 
over that the proxy appropriately update/transfer state at 
the host once it resumes. In theory, specialized process- 
ing would be attractive for policy-dependent traf- 
fic that is both important and frequently-occurring (since 
otherwise we could simply drop unimportant traffic and 
wake the host to process infrequent traffic). 

Of course, in addition to the the above (classes A- 
C), for traffic that a proxy doesn’t ignore but doesn’t 
want/know to handle a proxy always has the option of 
waking the host. Essentially the decision of whether to 
handle desired traffic in the proxy versus waking the host 
represents a tradeoff between the complexity of a proxy 
implementation and the sleep time of hosts. 


5.2 Example Proxies 


We now present four concrete proxy designs derived 
from the distinctions drawn above. We select these prox- 
ies to be illustrative of the design tradeoffs possible but 
also representative of practical and useful proxy designs. 


proxy_I We start with a very simple proxy that: (1) 
ignores all traffic listed as ignorable in Table 2 and (2) 
wakes the machine to handle all other incoming traffic. 
Besides clearly ignorable protocols, we choose to also 
ignore traffic generated by the Bigfix application (TCP 
port 63422) , which we previously identified (Section 4) 
to be one of the big offenders. We do so because this traf- 
fic is a) not representative for non-Intel machines, and b) 
the application is very badly configured — sending very 
large amounts of traffic for little offered functionality — 
making sleep almost impossible. 

This proxy is simple — it requires no mechanical or spe- 
cialized processing. At the same time, because it makes 
the conservative choice of waking the host for all traf- 
fic not known to be safely ignorable, this proxy is fully 
transparent to users and applications, in the sense that 
the effective behavior of the sleeping machine is never 
different from had it been idle (except for the perfor- 
mance penalties due to the additional wake-up time). 


proxy.2 Our second proxy is also fully transparent, but 
takes on greater complexity in order to reduce the fre- 
quency with which the machine must be woken. This 
proxy: (1) ignores all traffic listed as ignorable in Table 2, 
and (2) issues responses for protocol traffic listed in the 
same table as to be handled with mechanical responses 
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and (3) wakes the machine for all other incoming traffic. 
Since this proxy needs more state to generate mechani- 
cal responses (e.g., the NetBIOS Names of local services, 
needed to answer NBNS queries), it can also use this ex- 
tra information to selectively ignore more packets than 
proxy 2 (e.g., ignore all NetBIOS datagrams not des- 
tined for local services). 


proxy.3 Our third proxy generates even deeper savings 
by only maintaining a small set of applications, (chosen 
by the user) operable during idle times, while ignoring all 
other traffic. We use telnet, ssh, VNC, SMB file-sharing 
and NetBIOS as our applications of interest. This proxy 
performs the same actions (1) and (2) as implemented by 
proxy-_2 (ignore and responds to the same set of proto- 
cols), but it (3) wakes up for all traffic belonging to any 
of telnet, ssh, VNC, SMB file-sharing and NetBIOS and 
(4) drops any other incoming traffic. Relative to our pre- 
vious example, proxy_2 is less transparent in that the 
machine appears not to be sleeping for some select re- 
mote applications, but is inaccessible to all others. 


proxy.4 All the above proxies implement functionality 
related to handling incoming packets. In our final proxy, 
we also consider waking up for scheduled tasks initiated 
locally. This proxy behaves identically to proxy _3 with 
respect to incoming packet, but supports an additional 
action: (5) wake up for the following tasks (for which 
we assume that the system is configured to wake up in 
order to perform them): regular network backups, anti- 
virus (McAfee) software updates, FTP traffic for auto- 
matic software updates, and Intel specific updates. 


Evaluating tradeoffs In the following we compare the 
sleep achievable by our 4 proposed proxies, and com- 
pare it with the baseline WoP case. We perform this eval- 
uation for both office and home environments, and in 
each case we evaluate 3 possible values for transition 
times ts: 5, 10, and 60 seconds. The first of these (5s) 
is a very optimistic transition time, not achievable today 
using S3 sleep states, but foreseeable in the near future 
(today, Microsoft Vista specifications require computers 
to resume from S3 sleep in under 2s [18]). The second 
(10s) is representative of the shortest transitions achiev- 
able today [6], and the last (1min) is representative of a 
setting that allows almost a minute for processing sub- 
sequent relevant network packets before going to sleep 
again. The advantage of using a very short timer before 
going to sleep is the increased achievable sleep. The dis- 
advantage is that the delay penalty for waking the host 
will be incurred at more packets. In the extreme case of 
very short sleep timers, this could make remote appli- 
cations sluggish and un-responsive. For the wake events 
generated by scheduled tasks, we use a longer transition 
time (and thus a longer sleep timer value) of Imin, since 
such tasks usually take longer time to complete. 
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Figure 16: Savings achieved by different proxies in home 
and office environments. 


Examining the performance of our proxies, we make 
the following high-level observations: a) At one end of 
the spectrum, proxy-_1(the simplest) is inadequate in 
office environments, and borderline adequate in home 
environments. 5) At the other end of the spectrum we 
have proxy 3, which only handles a select number of 
applications, but in return achieves good sleep in all sce- 
narios — more than 70% of idle time even in the office 
and with a transition time of Iminute. c) The efficiency 
of proxy 2 depends heavily on environment. While the 
additional complexity (compared to proxy _1) makes it 
a good fit in home environments (sleeping close to 60% 
even for ts = 1min), having to handle all traffic makes 
it a worse fit for the office (sleeping ~ 12% for the same 
transition time). This shows that, unless they support a 
large number of rules, transparent proxies are a better fit 
for home, but not the office. d) The best tradeoff between 
functionality and savings, and therefore the appropriate 
proxy configuration, depends on the operating environ- 
ment. e) Since scheduled wake-ups are typically infre- 
quent, the impact they have on sleep is minimal — in our 
case, proxy-4 sleeps almost as much as proxy-3 in all 
considered scenarios. 


5.3. A strawman proxy architecture 


Our study leads us to propose a simple proxy architecture 
that offers a unified framework within which we can ac- 
commodate the multiplicity of design options identified 
above. The proposal we present is a high-level one since 
our intent here is merely to provide an initial sketch of 
an architecture that could serve as the starting point for 
future discussion on standardization efforts. 

The core of our proposal is a table—the power-proxy 
table (PPT)—that stores a list of rules. Each rule de- 
scribes the manner in which a specified traffic type 
should be handled by the proxy when idle. A rule con- 


sists of a trigger, an action and a timeout. 

Triggers are either timer events or regular expressions 
describing some network traffic of interest. When a trig- 
ger’s timer event fires or if an incoming packet matches a 
trigger’s regular expression, the proxy executes the cor- 
responding action. If the action involves waking the host, 
the timeout value specifies the minimum period of time 
for which the host must stay awake before contemplating 
sleep again. To resolve multiple matching rules, standard 
techniques such as ordering the rules by specificity, pol- 
icy, etc. can be used. The proxy table must also include a 
default rule that determines the treatment of packets that 
do not match on any of the explicitly enumerated rules. 
We propose the following actions: 

e drop: the incoming packet is dropped. 

e wake: the proxy wakes the host and forwards the pack- 
ets to it. Other packets buffered while waiting for the 
wake will be forwarded as well. 

e respond(template, state): the proxy uses the 
specified template to craft a response based on the in- 
coming packet and some state stored by the proxy. This 
action is used to generate mechanical responses as de- 
scribed below. 

e redirect(handle): the proxy forwards the packet to 
a destination specified by the handle parameter. This 
is used to accommodate specialized processing as de- 
scribed below. 

A response template is a function that computes the 
mechanical response based on the incoming packet and 
one or more immutable pieces of state. This means that 
our function does not maintain or change any state. There 
is no state carried over between successive incoming 
packets (such as sequence numbers), and no state trans- 
fer between the proxy and the host upon wake-up. We 
choose to support this functionality because a) it is rel- 
atively simple to implement in practice and b) it covers 
most of the non-application specific traffic, as shown in 
Section 4, and illustrated in our proxy examples. 

To accommodate more specialized processing, we as- 
sume developers will write application-specific stubs and 
then enter a redirect rule into the proxy’s PPT, where 
the handle specifies the location to which the proxy 
should send the packet. Such stubs can run on machine 
accessible over the network (e.g., a server dedicated to 
proxying for many sleeping machines in a corporate 
LAN), or on a low-power micro-engine supported on 
the local host (e.g., a controller on the motherboard, or 
a USB-connected gumstick). In all these cases, the han- 
dle would be specified by its address, for example a (IP 
address, port) combination. The redirect abstraction thus 
allows us to accommodate specialized processing with- 
out embedding application-specific knowledge into the 
core proxy architecture. 

The external API to this proxy is twofold: (1) APIs to 
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Figure 17: Example Click implementation. 


activate/deactivate the proxy as the host enters/exits sleep 
and (2) APIs to insert and delete rules. The process by 
which to install and execute stubs is outside of the core 
proxy specification which only provides the mechanism 
to register and invoke such stubs. The architecture is ag- 
nostic to where the proxy runs allowing implementations 
in hardware (e.g., at host NICs), in PC software (e.g., a 
proxy server running on the same LAN) or in network 
equipment (e.g., a firewall, NAT box). 

Finally, the use of timer events to wake a host already 
exists today. Our contribution here is merely to integrate 
the mechanism into a unified proxy architecture. 


5.4 Proxy Prototype Implementation 


To illustrate the feasibility of our architecture, we build 
a simple proxy prototype using the Click modular 
router [17]. We choose to deploy the proxying function- 
ality in a standalone machine responsible for maintaining 
the network presence of several hosts on the same LAN. 
To allow our proxy (let us call it P) to sniff the traffic for 
each host, we ensure that P shares the same broadcast 
domain with these hosts. This can be achieved either by 
connecting the proxy and the machines to a common net- 
work HUB, or by configuring the LAN switch to forward 
all traffic to the port that serves P. 

In our initial design, we don’t implement proxies that 
involve transferring state between the host and the proxy. 
Instead, P learns the pieces of state required (e.g. the IP 
address and the Netbios name for each host) by sniff- 
ing host traffic and extracting the state exchanged (e.g. 
ARP and NBNS exchanges). This design circumvent the 
need for any end-host modifications, and support proxy- 
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ing for machines with different hardware platforms (new 
and old) and operating systems. The proxy requires min- 
imal configuration (a list of the MAC addresses of the 
hosts that need to be proxied), and can be incremen- 
tally deployed as a low-power stand-alone network box. 
Once low-power proxying standards are developed [12], 
the design can be extended to support state transfer, and 
achieve even deeper energy savings. 

Our prototype implements very basic proxying func- 
tionality, but the software architecture (presented in Fig- 
ure 17) can be easily extended to more protocols and 
use cases. Currently, we support three types of actions: 
wake, respond and drop. The proxy awakes its hosts for 
TCP connection requests (incoming TCP SYN packets) 
and incoming Netbios Name Queries for the host’s NB 
name. If such a “wake packet” for a sleeping host arrives, 
P buffers the request, sends a magic packet to wake the 
host, and relays the buffered packet once the host be- 
comes available. The proxy responds automatically to 
incoming ARP requests, and drops all other incoming 
packets. In relation to the examples discussed in Sec- 
tion 5.2, this prototype has a simple and non-transparent 
design. To determine whether a host is awake, the proxy 
sends periodic ARP queries to each host; if these queries 
receive no response, the host is assumed to be asleep. 
When the proxy attempts to wake a host and fails repeat- 
edly, the host is assumed to be off, rather than just asleep, 
and the proxy ceases to maintain its network presence. 

Figure 17 presents the software architecture of our 
Click proxy, and highlights the mapping between Click 
modules and the generic categories of triggers, actions 
and state, discussed in the strawman proxy architecture. 

We test our Click-based proxy implementation by in- 
stalling it on one of our enterprise desktops, and con- 
figuring the proxy to maintain the network presence of 
several IBM ThinkPad laptops. We use this deployment 
to measure the delays experienced by applications wak- 
ing a sleeping host, and find these to be surprisingly low: 
2.45 on average, and 4s at maximum — much lower than 
the 30s TCP SYN timeout. These delays includes the 
host wake-up delay (+ 1.4s), and the additional time re- 
quired for the proxy to detect the state change and relay 
the buffered packet causing the wake (~ 1s). We defer 
a comprehensive deployment-based evaluation to future 
work. 


6 Power-Aware System Redesign 


In this section we consider approaches that might assist 
in reducing idle-time energy consumption by either sim- 
plifying the implementation of proxies or altogether ob- 
viating the need for proxying. 


6.1 Software Redesign 


Our idle traffic analysis shows that solutions relying 
on Wake-on-LAN functionality face the following chal- 
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lenges: (1) It is difficult to decide if various packets and 
protocols warrant a machine wake-up.(ii1) Hosts receive 
many packets even when idle (3 per second on average). 
(iii) Many protocols exchange packets periodically, pre- 
venting long quiet periods when hosts could sleep. These 
challenges could be dealt with at both application and 
protocol level: 


Power-aware application configuration Today, appli- 
cations and services are typically designed or configured 
without taking into account their potential impact on the 
power management at end-systems. For example, in Sec- 
tion 4.4 we discussed a tool called Bigfix, that checks if 
network hosts conform to Intel’s corporate security spec- 
ifications. This application was configured to perform 
these checks very aggressively, continuously generating 
large amounts of traffic. Under a WoL approach, this ap- 
plication alone would have made prolonged sleep virtu- 
ally impossible. 

This is a perfect example of the behaviour that could be 
avoided by configuring applications to be more power- 
aware, and perform periodic tasks less frequently, reduc- 
ing the volume of network traffic seen by hosts. 


Protocol Specification The decision to ignore or wake 
on a packet can be difficult, and involves protocol pars- 
ing, maintaiing a long set of filters and rules, and for 
some protocols host or application-specific state. 

To eliminate the complexity of this decision, and al- 
low hosts to sleep longer even when using very simple 
rules for waking, protocols could be augmented to carry 
explicit power-related information in their packets. An 
example of such information would be a simple bit indi- 
cating whether a packet can be ignored. 


Protocol Redesign We believe these principles should 
be followed when designing power-aware protocols. 


Consideration when using broadcast and multicast: We 
saw earlier that broadcast and multicast are mainly re- 
sponsible for keeping hosts awake. This type of traffic 
could be substantially reduced by redesigning protocols 
to use broadcasts sparingly. Some protocols are partic- 
ularly inefficient in this respect. For example, all Net- 
BIOS datagrams are always sent over Ethernet broadcast 
frames. These frames are received by all hosts on the 
LAN, and then discarded by most of them. This ranks 
NBDGM as one of the top “offenders”, yet this could be 
easily avoided by using unicast transmissions when pos- 
sible. Another approach is based on the observation that 
many service discovery protocols have redundant func- 
tionality. This redundant functionality could conceivable 
be replaced by a single service that can be shared by a 
multiplicity of applications. 


Synchronization of periodic traffic: One way to in- 
crease the number of long periods of network quies- 
cence would be to identify protocols that use periodic 


updates/message exchanges, and try to synchronize, or 
bulk these exchanges together. This would allow ma- 
chines to periodically wake up, process all notifications 
and request, and resume sleep. 


Complementing soft state: Many protocols (e.g., SSDP, 
NetBIOS, etc.) maintain and update state using peri- 
odic broadcast notifications/ For such protocols (and 
for similar applicatios), it would be essential to make 
them disconnection-tolerant, by providing complemen- 
tary state query mechanisms that could be used quickly 
build up-to-date copies of the soft state upon waking. 
This would enable ignoring any soft state notifications. 
Today, such query mechanisms exist only for some of 
these protocols, and they are often inefficient. 


6.2 Hardware Redesign 


A general goal of energy saving mechanisms, especially 
hardware designs, is to lead the industry towards energy 
proportional computing [8]. If energy consumption of a 
machine would accurately reflect its level of utilization, 
the energy would be zero when idle. Sleep states are a 
step in this direction, P-states (low power active opera- 
tion) are another. Related to this, it would be very desir- 
able to expose power saving states (S states) that feature 
better transition times, even if they offer smaller savings. 
Given the small inter-packet gaps, these states will come 
in handier than the deep-sleep ones. 


7 Related Work 


The notion that internetworked systems waste energy 
due to idle periods has been frequently reiterated[14, 
13, 16, 19, 10, 7, 15]. Network presence proxying for 
the purpose of saving energy in end devices was first 
proposed over ten years ago by Christensen ef al.; in 
follow-up work [11] the authors quantify the potential 
savings using traffic traces from a single dormitory ac- 
cess point and in [13] examine the traffic received at a 
single idle machine to identify dominant protocols and 
discuss whether these can be safely ignored. Our work 
draws inspiration from this early work extending it with 
a large-scale and more in-depth evaluation of idle-time 
traffic in enterprise and home environments. A more re- 
cent proposal [7]. postulates the notion of “selective con- 
nectivity”, whereby a host can dictate or manage its net- 
work connectivity, going to sleep when it does not want 
to respond to traffic. 

There is an extensive literature on energy saving tech- 
niques for individual PC platforms. Broadly, these aim 
for reduced power draws at the hardware level and faster 
transition times at the system level. These offer a com- 
plementary approach to reducing the power draw of 
idle machines; if and when these techniques lead us to 
perfectly “energy-proportional”’ computers, the idle-time 
consumption will be less problematic and proxying will 
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fade in importance. So far however, achieving such en- 
ergy proportionality has proved challenging. 

In parallel work [6], the authors build a prototype proxy 
supporting BIT-TORRENT and IM as example applica- 
tions. Our work considers a broader proxy design space, 
evaluating the tradeoffs between design options and the 
resultant energy savings informed by detailed analysis 
of network traffic. In relation to our design space, their 
proxy supports BT and IM using application stubs. 


$8 Conclusions 


In general, the question of how a proxy should handle 
the user-idle time traffic presents a complex tradeoff be- 
tween balancing the complexity of the proxy, the amount 
of energy saved, and the sophistication of idle-time func- 
tionality. Through the use of an unusual dataset, collected 
directly on endhosts, we explored the potential savings, 
requirements and effectiveness of technologies that aim 
to put endhost machines to sleep when users are idle. 
For the first time here, we dissect the different categories 
of traffic that are present during idle times, and quan- 
tify which of them have traffic arrival patterns that pre- 
vent periods of deep sleep. We see that broadcast and 
multicast traffic constitute a substantial amount of the 
background chatter due to service discovery and routing 
protocols. Our data also revealed a significant amount of 
outgoing connections, generated in part by enterprise ap- 
plications. We tried to identify which traffic can be ig- 
nored and found that most of the broadcast and multicast 
traffic, as well as roughly 75% of outgoing connections, 
appears safely ignorable. Handling unicast traffic is more 
involved because it harder to infer the intent of such traf- 
fic, and often needs some state information to be main- 
tained on the proxy. 

After having studied our traffic and the sleep poten- 
tial those patterns contain, we discuss the design space 
for proxies, and evaluate the savings offered by 4 sam- 
ple proxy designs. These cases reveal the tradeoffs be- 
tween design complexity, available functionality and en- 
ergy savings, and discuss the appropriateness of vari- 
ous design points in different use environments, such as 
home and office. 

Finally, we present a general and flexible strawman 
proxy architecture, and we build an extensible Click- 
based proxy that exemplifies one way in which this ar- 
chitecture can be implemented. 
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Wishbone: Profile-based Partitioning for Sensornet Applications 
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Abstract 


The ability to partition sensor network application code across 
sensor nodes and backend servers is important for running com- 
plex, data-intensive applications on sensor platforms that have 
CPU, energy, and bandwidth limitations. This paper presents 
Wishbone, a system that takes a dataflow graph of operators 
and produces an optimal partitioning. With Wishbone, users 
can run the same program on a range of sensor platforms, in- 
cluding TinyOS motes, smartphones running JavaME, and the 
iPhone. The resulting program partitioning will in general be 
different in each case, reflecting the different node capabili- 
ties. Wishbone uses profiling to determine how each opera- 
tor in the dataflow graph will actually perform on sample data, 
without requiring cumbersome user annotations. Its partition- 
ing algorithm models the problem as an integer linear program 
that minimizes a linear combination of network bandwidth and 
CPU load and uses program structure to solve the problem ef- 
ficiently in practice. Our results on a speech detection applica- 
tion show that the system can quickly identify good trade-offs 
given limitations in CPU and network capacity. 


1 Introduction 


An important class of sensor computing applications 
are data-intensive, involving multiple embedded sensors 
each sampling data at tens or hundreds of kilohertz and 
generating many megabytes per second in aggregate. Ex- 
amples include acoustic localization of animals, gun- 
shots, or speakers; structural monitoring and vibration 
analysis of bridges, buildings, and pipes; object tracking 
in video streams, etc. Over the past few years, impres- 
sive advances in sensor networking hardware and soft- 
ware have made it possible to prototype these applica- 
tions. However, two challenges confront the developer 
who wants to deploy and sustain these applications: 


e Heterogeneity: Thanks to hardware advances, one 
can run these applications on a variety of embed- 
ded devices, including “motes”, smartphones (which 
themselves are varied), embedded Linux devices 
(e.g., Gumstix, WiFi access points), etc. This rich- 
ness of hardware and software is good because it al- 
lows the developer to pick the right platforms for a 
task and evolve the infrastructure with time. On the 
other hand, it poses a software nightmare because 
it requires code to be developed multiple times, or 
ported to different platforms. 


e Decomposition: A simple way of designing such 
systems would deliver all the gathered data to a cen- 
tral server, with all the computation running there. 


This approach may consume an excessive amount of 
bandwidth and energy. A different approach is to 
run all of the computation “in the sensor network”, 
but often the computational capabilities of the sen- 
sor nodes are insufficient. The question is: how best 
to partition an application between the server(s) and 
the embedded nodes? Improper partitioning can lose 
important data, waste energy, and may cause appli- 
cations to simply not work as desired. 


No current solution addresses both of these challenges. 
To support heterogeneity, one might be able to write pro- 
grams in a language like Java. Unfortunately, some plat- 
forms do not support Java, or may not support it in its full 
generality; in addition, Java virtual machines for embed- 
ded devices are of uneven quality. More importantly, it 
is difficult to partition such a program in a way that will 
perform well on any given platform without a significant 
amount of tuning and manual optimization. That, in turn, 
limits the ability to swap out the underlying hardware 
platform, or even to move computation between the em- 
bedded nodes and servers. 

We have developed Wishbone, a system that allows 
developers to achieve both goals for applications that sat- 
isfy two conditions: 


e Streaming dataflow model: The application should 
be written as a stream-oriented collection of opera- 
tors configured as a dataflow graph. 


e Predictable input rates and patterns: The input data 
rates at the sensors gathering data don’t change in 
unpredictable ways. 


To use Wishbone, the developer writes a program in a 
high-level stream-processing language, WaveScript [16], 
which has a common runtime for both embedded nodes 
and servers. We have extended our open-source Wave- 
Script compiler to produce efficient code for several em- 
bedded platforms: TinyOS 2.0, smartphones running 
Java J2ME, the iPhone, Nokia tablets, various WiFi ac- 
cess points, and any POSIX compliant platform support- 
ing GCC. These platforms are sufficiently diverse that 
generating high-performance native code from a shared 
high-level language is itself a challenge. Fortunately, we 
have an advantage in WaveScript’s domain-specificity: 
the compiler has additional information that it can use to 
optimize programs for specific streaming workloads. 

We have used WaveScript in several applications, in- 
cluding: locating wild animals with microphone arrays, 
locating leaks in water pipelines, and detecting potholes 
in sensor-equipped taxis. For the purposes of this paper, 
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we chose to focus on two applications that highlight the 
program partitioning features of Wishbone: a speech de- 
tector that identifies when a person is speaking in a room 
and a 22-channel EEG application. Each is based on an 
application currently in use by our group (EEG) or by 
other groups (speaker detection). Both were ported! to 
WaveScript for the evaluation in this paper. 

The key function of Wishbone is, given a WaveScript- 
produced dataflow graph of stream operators, to parti- 
tion it into in-network and server-side components. It 
uses a profile-driven approach, where the compiler exe- 
cutes each operator against programmer-supplied sample 
data, using real embedded hardware or a cycle-accurate 
simulation. After profiling, we are able to estimate the 
CPU and communication requirements of every opera- 
tor on every platform. Wishbone depends on this sample 
data being representative of the actual input the sensor 
will see during deployment; we believe this is a valid as- 
sumption and justify it in our experiments. 

Determining a good partitioning is difficult even af- 
ter one uses a profiler to determine the computational 
and network load imposed by each operator. Wishbone 
models the partitioning problem as an integer linear pro- 
gram (ILP), seeking to minimize a combination of net- 
work bandwidth and CPU consumption subject to hard 
upper bounds on those resources. With these criteria, 
our ILP formulation will find optimal solutions—and al- 
though ILP is an NP-hard problem, in practice our imple- 
mentation can partition dataflow graphs containing over 
a thousand operators in a few seconds. 

Our results show that the system can quickly identify 
the optimal partition given constraints on CPU and net- 
work capacity. And picking the right partition matters. In 
our evaluation, our weakest platform got 0% of speaker 
detection results through the network successfully when 
doing all work on the server, and 0.5% when doing all 
work at the node. We can do 20x better by picking the 
right intermediate partition. Because the optimal parti- 
tioning changes depending on the hardware platform and 
the number of nodes in the network, manual partitioning 
is likely to be tedious at best. For larger graphs (such 
as our 1412 node electroencephalography (EEG) appli- 
cation), doing the partitioning by hand with any degree 
of confidence becomes extremely difficult. 

Finally, we note that we do not intend that Wishbone 
be used only as a completely automated partitioning tool, 
but also as a part of an interactive design process with the 
programmer in the loop. In addition to recommending 
partitions, Wishbone can find situations in which there 
is no feasible partitioning of a program; e.g., because 


'WaveScript is an imperative language with a C-like syntax. An 
initial port of an application from C/C++ 1s very quick: cut, paste, and 
clean it up. Refactoring to expose the parallel/streaming structure of 
the application may be more involved. 
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fun FIRFilter(coeffs , strm) { 
N = Array: length(coeffs ); 
fifo = FIFO: make(N); 
for i = 1 to N—1 { FIFO: enqueue(fifo , 
iterate x in strm { 
FIFO: enqueue(fifo , x); 
sum = Q; 
for i = 0 to N—-1 { 
sum += coeffs[i] * FIFO:peek(fifo, 1); 


0) }; 


} 
FIFO: dequeue (fifo ); 
emit sum; 


} 


fun LowFreqFilter(strm) { 
evenSignal = GetEven(strm); 
oddSignal = GetOdd (strm); 
// even samples go to one filter, 
lowFreqEven = FIRFilter (hLow_Even, 
lowFreqOdd = FIRFilter (hLow_Odd, 
// now recombine them 
AddOddAndEven(lowFreqEven , 


odds the other: 
evenSignal ); 
oddSignal ); 


lowFreqOdd ) 


fun GetChannelFeatures(strm) { 
lowl = LowFregqFilter(strm ); 


low2 = LowFregqFilter(lowl ); 
low3 = LowFregqFilter (low2 ); 
high4 = HighFreqFilter(low3); // we need this 
low4 = LowFreqFilter(low3 ); 
level4 = MagWithScale(filterGains [3], high4); 
highS5 = HighFreqFilter(low4); // and this one 
low5 = LowFreqFilter(low4 ); 
level5 = MagWithScale(filterGains [4], high5); 
high6 = HighFreqFilter(low5); // and this one 
level6 = MagWithScale(filterGains [5], high6); 


zipN([level4 , level5 , level6]); 

Figure 1: Excerpts from running code in EEG-application. The 
“low level” FIRFilter function constructs new dataflow opera- 
tors using iterate. FIRFilter is stateful because it maintains 
and modifies fifo. Higher level functions such as LowFreq- 


Filter and GetChannelFeatures wire together a larger graph. 


the bandwidth requirements will always exceed avail- 
able network bandwidth, or because there are insufficient 
CPU resources to place bandwidth-reducing portions of 
the program inside the sensor network. In these cases, 
the programmer will have to either switch to a more pow- 
erful node platform, reduce the sampling rates or the 
number of sensors, or be willing to run the network in 
an overload situation where some samples are lost. In 
the overload case, Wishbone can compute how much the 
data rates need to be reduced to achieve a viable partition. 


2 Language and front-end compiler 


The developer writes a program in WaveScript that con- 
structs a dataflow graph of stream operators. Each op- 
erator consists of a work function and optional private 
state. The job of the WaveScript front-end compiler is 
to partially evaluate the program to create the dataflow 
graph, whereas the WaveScript backend performs graph 
optimizations and reduces work functions to an interme- 
diate language that can be fed to a number of backend 
code generators. Each work function contains an im- 
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perative routine that processes a single stream element, 
updates the private state for that dataflow operator, and 
produces elements on output streams. (Later, we will 
single out stateless operators that maintain no mutable 
State between invocations.) 

A WaveScript source program can manipulate streams 
as values and thereby wire together operator graphs, as 
seen in Figure 1. The example in Figure | contains psue- 
docode that wires together the cascading filters found in 
one of the 22-channels of our EEG application. The eval- 
uation of the iterate form creates a new dataflow opera- 
tor and provides its work function. The return value of an 
iterate 1s its output stream. For example, the function 
FIRFilter in Figure | takes a stream as one of its inputs 
and returns a stream. Within the body of the iterate the 
emit keyword produces elements on the output stream. 
The equal (=) operator introduces new variables and the 
last expression in a {...} block is its return value. Type 
annotations are unnecessary. 


2.1 Program Distribution 


Thus far, our description applies to WaveScript programs 
that run on a single node. To support distributed execu- 
tion, we extended the language to allow developers to 
specify which part of the dataflow graph should be repli- 
cated on all embedded nodes. This specification is log- 
ical rather than physical; the physical locations of oper- 
ators are computed by Wishbone’s partitioner using the 
programmer’s annotations and profiler data. 

To create the logical specification in Wishbone, the 
user places a subset of the program’s top-level stream 
bindings in a Node{} namespace. All operators in the 
Node{} namespace are replicated once per embedded 
node. This separation is particularly important for state- 
ful operators, because stateful operators in the Node par- 
tition have an instance of their state for every node in the 
network. Stateful operators on the server side are instan- 
tiated only once. 

As an example, consider the code snippet in Fig- 
ure 2, which shows a node/server program that samples 
data from the microphone and filters it. The operator 
readMic, producing the stream si, must reside on each 
node, as it samples data from hardware only available on 
the embedded node. Because the £i1tAudio call produc- 
ing s2 1s in the Node partition, its operators will be repli- 
cated once per node, but can be physically placed either 
on the embedded node or the server, depending on what 
the partitioner determines would be best. If fi1tAudio 
creates stateful operators, their state will need to repli- 
cated once per node, regardless of where they are placed. 
This example illustrates the basic repartitioning model, 
and shows that, while the system is free to move some 
operators, there are certain relocation constraints the par- 
titioner must respect, discussed in the next section. 





embedded node partition 


Pe 
nan \ Unpinned nodes 
a> Ss = ; Moveable by partitioner 
s2 So 
— radiomsgs = —- —- —- — — } 
-—— implicit merge s3 = f(s2) 


point : 
main = s3 
S3 


main 
server partition 


namespace Node { 
sl = readMic(...) 
s2 = filtAudio(s1l) 


Figure 2: A program skeleton specifying a replicated stream 
computation across all embedded nodes. 


2.1.1 Relocation Constraints 


Operators are classified as movable or pinned as fol- 
lows. First, operators with side-effects—for example, 
OS-specific foreign calls to sample sensors and blink 
LEDs—are pinned to their partition. Likewise, operators 
on the server that print output to the user or to a file are 
pinned. Stateless operators without side-effects are not 
pinned and are always moveable, allowing them to be 
moved into the other partition if the system determines 
that to be advantageous. Finally, stateful operators are 
treated differently for the node and server partitions. It is 
not generally possible to move stateful server operators 
into the network—they have a serial execution seman- 
tics and a single state instance. However, it is possible 
to move stateful operators from the node partition to the 
server. The state of the operator is duplicated in a table 
indexed by node ID. Thus, a single server operator can 
emulate many instances running within the network. 


Relocating stateful operators in this way raises a dif- 
ferent issue—message loss on wireless links. Operators 
in the node partition may safely assume that all edges be- 
tween the raw sensors and themselves are lossless. Re- 
locating an operator to the server means putting poten- 
tial data loss upstream of it that was not there previously. 
Stateless operators are insensitive to this kind of loss be- 
cause they process each element without any memory of 
preceding elements, but stateful operators may perform 
erratically in the face of unexpected missing data, unless 
they have been intentionally engineered to tolerate it. 


Because tolerance to data loss in stateful operators is 
an application-specific issue, Wishbone supports two op- 
erational modes that can be specified by the programmer 
at compile time. In conservative mode it will not relocate 
stateful operators onto the server, refusing to add lossi- 
ness to a previously lossless edge. In permissive mode, 
the system will automatically perform these relocations. 
In the future, it would be possible to extend the system to 
make many finer distinctions, such as labeling individual 
edges as loss-tolerant, or grouping operators together in 
blocks that cannot be divided by a lossy edge. 
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2.1.2 Restrictions 


The system we present in this paper targets a restricted 
domain: first, because we focus on a specific dataflow 
model and, second, because of limitations of our current 
implementation. (Section 9 will discuss generalizing and 
extending the model.) Presently, our implementation re- 
quires that any path through the operator graph connect- 
ing a data source on the node to a data sink on the server 
may only cross the network once. The graph partitioning 
algorithm in Section 4 does, however, support back-and- 
forth communication. The reason for the restriction is 
that we haven’t yet implemented arbitrary communica- 
tion for all of our platforms. Note that this does not rule 
out all communication from the server to the nodes, it 1s 
still possible, for example, to have configuration param- 
eters sent from a server to in-network operators. 

We make the best of this restriction by leveraging it 
in a number of ways. As we will see, it enables a sim- 
plified version of the partitioning algorithm. It can also 
further filter the set of moveable operators as described 
in Section 2.1.1, because pinning an operator pins all up- 
or down-stream operators (can’t cross back). 


3 Profile & Partition 


The WaveScript compiler, implemented in the Scheme 
language, can profile stream graphs by executing them 
directly within Scheme during compilation (using sam- 
ple input traces). This produces platform-independent 
data rates, but cannot determine execution time on em- 
bedded platforms. For this purpose, we employ a sep- 
arate profiling phase on the device itself, or on a cycle- 
accurate simulator for its microprocessor. 

First, the partitioner determines what operators might 
possibly run on the embedded platform, discounting 
those that are pinned to the server, but including movable 
operators together with those that are pinned to the node. 
The code generator emits code for this partition, insert- 
ing timing statements at the beginning and end of each 
operator’s work function, and at emit statements, which 
represent yield points or control transfers downstream. 

The partition is then executed on simulated or real 
hardware. The inserted timing statements print output to 
a debug channel read by the compiler. For example, we 
execute instrumented TinyOS programs either on TMote 
Sky motes or by using the MSPsim simulator’. In either 
case, timestamps are sent through a real or virtual USB 
serial port, where they are collected by the compiler. 

For most platforms, the above timestamping method is 
sufficient. That is, the only relevant information for parti- 
tioning is how long each operator takes to execute on that 


*We also tried Simics and msp430-gdb for simulation, but MSP- 
sim was the easiest to use. Note that TOSSIM is not appropriate for 
performance modeling. 
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platform (and therefore, given an input data rate, the per- 
cent CPU consumed by the operator). For TinyOS, some 
additional profiling is necessary. To support subdividing 
tasks into smaller pieces, we must be able to perform a 
reverse mapping between points in time (during an oper- 
ator’s execution) and points in the operator’s code. Ide- 
ally, for operator splitting purposes, we would recover a 
full execution trace, annotating each atomic instruction 
with a clock cycle. Such information, however, would 
be prohibitively expensive to collect. We have found it is 
sufficient to instead simply time stamp the beginning and 
end of each for or while loop, and count loop iterations. 
As most time is spent within loops, and loops generally 
perform identical computations repeatedly, this enables 
us to roughly subdivide execution of an operator into a 
specified number of slices. 

After profiling, control transfers to the partitioner. The 
movable subgraph of operators has already been deter- 
mined. Next, the partitioner formulates the partitioning 
problem in terms of this subgraph, and invokes an exter- 
nal solver (described in Section 4) to identify the optimal 
partition. The program graph is repartitioned along the 
new boundary, and code generation proceeds, including 
generating communication code for cut edges (e.g., code 
to marshal and unmarshal data structures). Also, after 
profiling and partitioning, the compiler generates a visu- 
alization summarizing the results for the user. The visu- 
alization, produced using the well-known GraphViz tool 
from AT&T Research, uses colorization to represent pro- 
filing results (cool to hot) and shapes to indicate which 
operators were assigned to the node partition. 


4 Partitioning Algorithms 


In this section, we describe Wishbone’s algorithms to 
partition the dataflow graph. We consider a directed 
acyclic graph (DAG) whose vertices are stream operators 
and whose edges are streams, with edge weights repre- 
senting bandwidth and vertex weights representing CPU 
utilization or memory footprint. We only include vertices 
that can move across the node-server partition; i.e., the 
movable subset. The server is assumed to have infinite 
computational power compared to the embedded nodes, 
which is a close approximation of reality. 

The partitioning problem is to find a cut of the graph 
such that vertices on one side of the cut reside on the 
nodes and vertices on the other side reside on the server. 
The bandwidth of a given cut is measured as the sum 
of the bandwidths of the edges in the cut. An example 
problem is shown in Figure 3. 

Unfortunately, existing tools for graph partitioning are 
not a good fit for this problem. Tools like METIS [12] 
or Zoltan [7] are designed for partitioning large scien- 
tific codes for parallel simulation. These are heuristic 
solutions that generally seek to create a fixed number of 
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budget = 2 budget = 3 budget = 4 





bandwidth = 6 bandwidth = 5 


bandwidth = 8 


Figure 3: Simple motivating example. Vertices are labeled with 
CPU consumed, edges with bandwidth. The optimal mote par- 
tition is selected in red. This partitioning can change unpre- 
dictably, for example between a horizontal and vertical parti- 
tioning, with only a small change in the CPU budget. 


balanced graph partitions while minimizing cut edges. 
Newer tools like Zoltan support unbalanced partitions, 
but with a specified ratios, not allowing unlimited and 
unspecified capacity to the server partition. Further, they 
expect a single weight on each edge and each vertex. 
They cannot support a situation where the cost of a ver- 
tex changes depending on the partition is it placed in. 
This is the situation we’re faced with: diverse hardware 
platforms that not only have varying capacities, but for 
which the relative cost of operators varies (for example, 
due to a missing floating point unit). 

We may also consider traditional task scheduling algo- 
rithms as a candidate solution to our partitioning prob- 
lem. These algorithms assign a directed graph of tasks 
to processors, attempting to minimize the total execution 
time. The most popular heuristics for this class of prob- 
lem are variants of list scheduling, where tasks are prior- 
itized according to some metric and then added one at a 
time to the working schedule. But there are three major 
differences between this classic problem and our own. 
First, task-scheduling does not directly fit the nondeter- 
ministic dataflow model, as no conditional control flow is 
allowed at the task level—all tasks execute exactly once. 
Second, task-scheduling is not designed for vastly un- 
equal node capabilities. Finally, schedule length is not 
the appropriate metric for streaming systems. Schedule 
length would optimize for latency: how fast can the sys- 
tem process one data element. Rather, we wish to op- 
timize for throughput, which is akin to scheduling for a 
task-graph repeated ad infinitum. 

Thus we have developed a different approach. Our 
technique first preprocesses the graph to reduce the parti- 
tion search space. Then it constructs a problem formula- 
tion based on the desired objective function and calls an 
external ILP solver. By default, Wishbone currently uses 
the minimum-cost cut subject to not exceeding the CPU 
resources of the embedded node or the network capacity 


of the channel. Cost here is defined as a linear combina- 
tion of CPU and network usage, a-C'PU+/3- Net (which 
can be a proxy for energy usage). Therefore we set four 
numbers for each platform: the CPU/Network resource 
limits, and coefficients a, 3. The user may override these 
quantities to direct the optimization process. 


4.1 Preprocessing 


The graph preprocessing step precedes the actual parti- 
tioning step. The goal of the preprocessing step is to 
eliminate edges that could never be viable cut-points. 
Consider an operator u that feeds another operator v such 
that the bandwidth from v 1s the same or higher than the 
bandwidth on the output stream from u. A partition with 
a cut-point on the v’s output stream can always be im- 
proved by moving the cut-point to the stream wu — v; the 
bandwidth does not increase, but the load on the embed- 
ded node decreases (v moves to the server). Thus, any 
operator that is data-expanding or data-neutral may be 
merged with its downstream operator(s) for the purposes 
of the partitioning algorithm, reducing the search space 
without eliminating optimal solutions. 


4.2 Optimal Partitionings 


It is well-known that optimal graph partitioning is NP- 
complete [8]. Despite the intrinsic difficulty of the prob- 
lem, the problem proves tractable for the graphs seen in 
realistic applications. Our pre-processing heuristic re- 
duces the problem size enough to allow an ILP solver to 
solve it exactly within a few seconds to minutes. 


4.2.1 Integer Linear Programming (ILP) 


Let G = (V, E) be the directed acyclic graph (DAG) of 
stream operators. For all v € V, the compute cost on the 
node is given by c, > 0 and the communication (radio) 
cost is given by ry,» for all edges (u,v) € E. One might 
think of the compute cost in units of MHz (megahertz 
of CPU required to process a sample and keep up with 
the sampling rate), and the bandwidth cost in kilobits/s 
consumed by the data going over the radio. Adding ad- 
ditional constraints for RAM usage (assuming static allo- 
cation) or code storage is straightforward in this formu- 
lation, but we do not do it here. For each of these costs 
we can use either mean or peak load (profiling computes 
both). Because our applications have predictable rates, 
we use mean load here. Peak loads might be more appro- 
priate in applications characterized by “bursty” rates. 
The DAG G contains a set of terminal source ver- 
tices S, and sink vertices T’, that have no inward and 
outward edges, respectively, and where S,7’' C V. As 
noted above, we construct G from the original operator 
graph such that these boundary vertices are pinned—all 
the sources must remain on the embedded node; all sinks 
on the server. Recall that the partitioning problem is to 
find a single cut of G that assigns vertices to the nodes 
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and server. We can think of the graph G as corresponding 
to the server and a single node, but vertices assigned to 
the node partition are instantiated on all physical nodes 
in the system. 

We encode a partitioning using a set of indicator vari- 
ables f, € {0,1} for all v in V. If f, = 1, then operator 
v resides on the node; otherwise, it resides on the server. 
The pinning constraints are: 


(Vu ES) fy =1 
(Vu ET) fy =0 
(Vu) fy € {0,1}. 


Next, we constrain the sum of node CPU costs to be 
less than some total budget C’. 


(1) 


(2) 


A simple expression for the total ‘cat bandwidth is 
D(uvyee fu — fo)*ruv- (Because fy € {0,1}, the 
square evaluates to 1 when the edge (u, v) is cut and to 0 
if it is not; | f,, — fy| gives the same values.) However, we 
prefer to formulate the integer programming problem as 
one with a linear rather than quadratic objective function, 
so that standard ILP techniques can be used. 

We can convert the quadratic objective function to a 
linear one by introducing two variables per edge, e€,,, and 


e)», Which are subject to the following constraints: 


cpu<C where cpu= Ss Tp Oe 





v 20 
=) 


je ice 0 (3) 


The intuition here is that when the edge (u,v) is not 
cut (1.e., w and v are in the same partition), we would 
like e,,, and e’,,, to both be zero. When wu and v are in 
different partitions, we would like a non-zero cost to be 
associated with that edge; the constraints above ensure 
that the cost is at least 1 unit, because /,, — f, is -1 when 
u 1s on the server and vw on the embedded node. These 
observations allow us to formulate the bandwidth of the 
cut, cap that bandwidth, and define the objective function 
in terms of both CPU and network load. 


net < N where net = S (Cag ee, Te 
(u,v)ck 


(4) 
(5) 


Any optimal solution of (5) subject to (1), (2), (3), and 
(4) will have e,,, + e/,,, equal to 1 if the edge is cut and to 
0 otherwise. Thus, we have shown how to express our 
partitioning problem as an integer programming prob- 
lem with a linear objective function, 2|E|-+|V| variables 


objective: min(a cpu+/ net) 
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(only |V| of which are explicitly constrained to be inte- 
gers), and at most 4|F| + |V| + 1 equality or inequality 
constraints. 

We could use a standard ILP solver on the formulation 
described above, but a further improvement is possible 
if we restrict the data flow to not cross back and forth 
between node and server, as described in Section 2.1.2. 
On the positive side, the restriction reduces the size of 
the partitioning problem, which speeds up its solution. 

With the above restriction, we can then flip all edges 
going from server to node for the purpose of partitioning 
(the communication cost would be the same under our 
model). With all edges pointed towards the server, and 
only one crossing of the network allowed, another set of 
constraints now apply: 


(V(u,v) € E) fu — fo 20 (6) 
With (6) the network load quantity simplifies: 
net = S ae — Te ae (7) 
(u,v)EF 
This formulation eliminates the e,,, and e’,,, variables, 


simplifying the optimization problem. We now have 
only |V| variables and at most |E| + |V| + 1 con- 
straints. | We have chosen this restricted formulation 
for our current, prototype implementation, primarily be- 
cause the per-platform code generators don’t yet support 
arbitrary back-and-forth communication between node 
and server. We use an off-the-shelf integer programming 
solver, 1p_solve’, to minimize (7) subject to (1) and (2). 
We note that the restriction of unidirectional data flow 
does preclude cases when sinks are pinned to embed- 
ded nodes (e.g., actuators or feedback in the signal pro- 
cessing). It also prevents a good partition when a high- 
bandwidth stream is merged with a heavily-processed 
stream. In the latter case, the merging must be done 
on the node due to the high-bandwidth stream, but the 
expensive processing of the other stream should be per- 
formed on the server. In our applications so far, we have 
found our restriction to be a good compromise between 
provable optimality and speed of finding a partition. 


4.3 Data Rate as a Free Variable 


It is possible that the partitioning algorithm will not be 
able to find a cut that satisfies all of the constraints (1.e., 
there may be no way to “‘fit” the program on the embed- 
ded nodes.) In this situation we wish to find the maxi- 
mum data rates for input sources that will support a vi- 
able partitioning. The algorithm given above cannot di- 
rectly treat data rate as a free variable. Even if CPU and 


31p_solve was developed by Michel Berkelaar, Kjell Eikland, 
and Peter Notebaert. It uses branch-and-bound to solve integer- 
constrained problems, like ours, and the Simplex algorithm to solve 
linear programming problems. 
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network load varied linearly with data rate, the resulting 
optimization problem would be non-linear. However, it 
turns out to be inexpensive to perform the search over 
data-rates as an outer loop that on each iteration calls the 
partitioning algorithm. 

This is because in most applications, CPU and net- 
work load increase monotonically with input data rate. If 
there is a viable partition when scaling input data rates by 
a factor X, then any factor Y < X will also have a viable 
partitioning. Thus Wishbone simply does a binary search 
over data rates to find the maximum rate at which the par- 
titioning algorithm returns a valid partition. As long as 
we are not over-saturating the network such that sending 
fewer packets actually result in more data being success- 
fully received, this maximum sustainable rate will be the 
best rate to pick to maximize outputs (throughput) of the 
data flow graph. We will re-examine this assumption in 
Section 7. 


5 Wishbone Platform Backends 


In this section, we describe three new WaveScript 
code generators we built for Wishbone, which are de- 
scribed here for the first ttme. These support ANSI C, 
NesC/TinyOS and JavaME. 


5.1 Code Generation: ANSI C and JavaME 


In contrast with the original WaveScript C++ back- 
end (and XStream runtime engine), our current C 
code-generator produces simple, single threaded code 
in which each operator becomes a function definition. 
Passing data via emit becomes a function call, and the 
system does a depth-first traversal of the stream graph. 
The generated code requires virtually no runtime and is 
easily portable. This C backend is used to execute the 
server-side portion of a partitioned program, as well as 
the node-side portion on Unix-like embedded platforms 
that run C, such as the iPhone (jailbroken), Gumstix, or 
Meraki. 

Generating code for JavaME also straightforward, as 
Java provides a high level programming environment that 
abstracts hardware management. The basic mapping be- 
tween the languages is the same as in the C backend. Op- 
erators become functions, and an entire graph traversal is 
a chain of function calls. Some minor problems arise due 
to Java’s limited set of numeric types. 


5.2 Code Generation: TinyOS 2.0 


Supporting TinyOS 2.0 is much more challenging. The 
difficulties are both due to the extreme resource con- 
straints of TinyOS motes (typically less than 10 KB of 
RAM and 100 KB of ROM), and to the restricted con- 
currency model of TinyOS (tasks must be be relatively 
short-lived and non blocking; all IO must be performed 
with split-phase asynchronous calls). Also, program ob- 
jects be serialized and split into small network packets. 


Wishbone’s support for TinyOS demonstrates its ability 
to use platforms with severe resource restrictions and un- 
usual concurrency models. 

Our prototype does not currently support WaveScript’s 
dynamic memory management in code running on 
motes. We may support it in the future, but it remains to 
be seen whether this style of programming can be made 
effective for extremely resource constrained devices. In- 
stead, we enforce that all operators assigned to motes use 
only statically allocated storage in our applications. 

The most difficult issue in mapping a high-level lan- 
guage onto TinyOS is handling the TinyOS concurrency 
model. All code executes in either task or interrupt con- 
text, with only a single, non-preemptive task running at a 
time. Wishbone simply maps each operator onto a task. 
Each data element that arrives on a source operator, for 
example a sensor sample or an array of samples, will re- 
sult in a depth-first traversal of the operator graph (exe- 
cuted as a series of posted tasks). This graph traversal 
is not re-entrant. Instead, the runtime buffers data at the 
source operators until the current graph traversal finishes. 

This simple design raises several issues. First, gen- 
erated TinyOS tasks must be neither too short nor too 
long. ‘Tasks with very short durations incur unneces- 
sary overhead, and tasks that run too long degrade sys- 
tem performance by starving important system tasks (for 
example, sending network messages). Second, the best 
method for transferring data items between operators is 
no longer obvious. In the basic C backend, we simply 
issue a function call to the downstream operator, wait for 
it to complete, and then continue computation. We can- 
not use this method under TinyOS, where it would force 
us to perform an entire traversal of the graph in a single 
very long task execution. But the obvious alternative also 
presents problems: executing an operator in its entirety 
before any downstream operators would require a queue 
to buffer all output elements of the current operator. 

The full details of TinyOS code generation are beyond 
the scope of this paper. In short, the WaveScript com- 
piler can convert programs programs into a cooperative 
multi-tasking form (via a CPS conversion). This serves 
two purposes: every call to emit can serve as a yield 
point, causing the task to yield to its downstream oper- 
ator in a depth-first fashion (with no queues), which in 
turn will re-post the upstream operator upon completing 
the traversal. Second, based on profiling data, additional 
yield points can be inserted to “split” tasks to adjust gran- 
ularity for system health. 


6 Applications 


We evaluate Wishbone in terms of two experimental ap- 
plications: acoustic speech detection and EEG-based 
seizure onset detection. Both of these applications ex- 
ercise Wishbone’s capability to automatically partition a 
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Figure 4: Custom audio board attached to a TMote Sky. 


single high-level program into components that run over 
a network containing sensor nodes and a server or “base 
station”. Neither of these applications is in itself novel. 
In both cases we ported existing implementations from 
Matlab and C to Wishbone and verified that the results 
matched the original implementations. 


6.1 Application: Seizure Onset Detection 


We used Wishbone to implement a_ patient-specific 
seizure onset detection algorithm [20]. The application 
was previously implemented in C++, but by porting it 
to Wishbone/WaveScript we enabled its embedded/dis- 
tributed operation, while reducing the amount of code by 
a factor of four without loss of performance. 

The algorithm is designed to be used in a system for 
detecting seizures outside a clinical environment. In this 
application, a user would wear a monitoring cap that typ- 
ically consists of 16 to 22 channels. Data from the cap is 
processed by a low-power portable device. 

The algorithm we employ [21] samples data from 22 
channels at 256 samples per second. Each sample is 16- 
bits wide. For each channel, we divide the stream into 
2 second windows. When a seizure occurs, oscillatory 
waves below 20 Hz appear in the EEG signal. To extract 
these patterns, the algorithm looks for energy in certain 
frequency bands. 

To extract the energy information, we first filter each 
channel by using a polyphase wavelet decomposition. 
We use a repeated filtering structure to perform the de- 
composition. The filtering structure first extracts the 
odd and even portions of the signal, passes each signal 
through a 4-tap FIR filter, then adds the two signals to- 
gether. Depending on the values of the coefficients in the 
filter, we either perform a low-pass or high-pass filtering 
operation. This structure is cascaded through 7-levels, 
with the high frequency signals from the last three levels 
used to compute the energy in those signals. Note that at 
each level, the amount of data is halved. 
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As a final step, all features from all channels, 66 in 
total, are combined into a single vector which is input 
into a patient-specific support vector machine (SVM). 
The SVM detects whether or not each window contains 
epileptiform activity. After three consecutive positive 
windows have been detected, a seizure is declared. 

There are multiple places where Wishbone can par- 
tition this algorithm. If the entire application fits on the 
embedded node, then the data stream is reduced to only a 
feature vector—an enormous data reduction. But data is 
also reduced by each stage of processing on each chan- 
nel, offering many intermediate points which are prof- 
itable to consider. 


6.2 Acoustic Speech Detection 


We used Wishbone to build a speech detection applica- 
tion that uses sampled audio to detect the presence of a 
person who is speaking near a sensor. The ultimate goal 
of such an application would be to perform speaker iden- 
tification using a distributed network of microphones. 
For example, such a system could potentially be used to 
locate missing children in a museum by their voice, or to 
implement various security applications. 

However, in our current work we are only concerned 
with speech detection, a precursor to the problem of 
speaker identification. In particular, our goal is to reduce 
the volume of data required to achieve speaker identifi- 
cation, by eliminating segments of data that probably do 
not contain speech and by summarizing the speech data 
through feature extraction. 

Our implementation of speech detection and data re- 
duction is based on Mel Frequency Cepstral Coefficients 
(MFCC), following the approach of prior work in the 
area. Recent work by Martin, et al. has shown that clus- 
tering analysis of MFCCs can be used to implement ro- 
bust speech detection [14]. Another article by Saasta- 
moinen, et al. describes an implementation of speaker 
identification on smartphones, based on applying learn- 
ing algorithms to MFCC feature sets [19]. Based on this 
prior work, we chose to exercise our system using an 1m- 
plementation of MFCC feature extraction. 


6.2.1 


Mel Frequency Cepstral Coefficients (MFCC) are the 
most commonly used features in speech recognition al- 
gorithms. The MFCC feature stream represents a signif- 
icant data reduction relative to the raw data stream. 


Mel Frequency Cepstral Coefficients 


To compute MFCCs, we first compute the spectrum of 
the signal, and then summarize it using a bank of over- 
lapping filters that approximates the resolution of hu- 
man aural perception. By discarding some of the data 
that is less relevant to human perception, the output of 
the filter bank represents a 4X data reduction relative 
to the original raw data. We then convert this reduced- 
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resolution spectrum from a linear to a log spectrum. Us- 
ing a log spectrum makes it easier to separate convolu- 
tional components such as the excitation applied to the 
vocal tract and the impulse response of a reverberant en- 
vironment, because transforms that are multiplicative in 
a linear spectrum are additive in a log spectrum. 

Finally, we compute the MFCCs as the first 13 coef- 
ficients of the Discrete Cosine Transform (DCT) of this 
reduced log-spectrum. By analyzing the spectrum of a 
spectrum, the distribution of frequencies can be charac- 
terized at a variety of scales [6, 5]. 


6.2.2 Trade-offs in MFCC Extraction 


The high level goal of Wishbone is to explore how a 
complex application written in a single high level lan- 
guage can be efficiently and easily distributed across 
a network of devices and support many different plat- 
forms. As such, the MFCC application presents an in- 
teresting challenge because for sensors with very limited 
resources there appears to be no perfect solution; rather, 
using Wishbone the application designer can explore dif- 
ferent trade-offs in application performance. 

These trade-offs arise because this algorithm squeezes 
a resource-limited device between two insoluble prob- 
lems: not only is the network capacity insufficient to for- 
ward all the raw data back to a central point, but the CPU 
resources are also insufficient to extract the MFCCs in 
real time. If the application has any partitioning that 
fits the resource constraints, then the goal of Wishbone 
is to select the best partition, for example, lowest cost in 
terms of energy. If the application does not fit at its ideal 
data rate, ultimately, some data will be dropped on some 
target platforms. The objective in this case is to find a 
partitioning that minimizes this loss and therefore maxi- 
mizes the throughput: the amount of input data success- 
fully processed rather than dropped at the input sources 
or in the network. 


6.2.3. Implementing Audio Capture 


Some platforms, such as the iPhone and embedded- 
Linux platforms (such as the Gumstix), provide a com- 
plete and reliable hardware and software audio capture 
mechanism. On other platforms, including both TMotes 
and J2ME phones, capturing audio is more challenging. 

On TMotes, we used a custom-built audio board to 
acquire audio. The board uses an electret microphone, 
four opamp stages, a programmable-gain amplifier , and 
a 2.5 V voltage reference. We have found that when 
the microphone was powered directly by the analog sup- 
ply of the TMote, the audio board performed well when 
the mote was only acquiring audio, but was very noisy 
when the mote was communicating. The communi- 
cation causes a slight modulation of the supply volt- 
age, which gets amplified into significant noise. Us- 


ing a separately regulated supply for the microphone re- 
moved this noise. The anti-aliasing filter is a simple 
RC filter; to better reject aliasing, the TMote samples 
at a high rate and applies a digital low-pass filter (fil- 
tering and decimating a 32 Ks/s stream down to 8 Ks/s 
works well). The amplified and filtered audio signal 
is presented to an ADC pin of the TMote’s microcon- 
troller, which has 12 bits of resolution. We use TinyOS 
2.0 ReadStream<uint16_t> interface to the ADC, 
which uses double buffering to deliver arrays of samples 
to the application. 


Phones naturally have built-in microphones and mi- 
crophone amplifiers, but we have nonetheless encoun- 
tered a number of problems using them as audio sen- 
sors. Many J2ME phones support the Mobile Media 
API (JSR-135), which may allow a program to record 
audio, video, and take photographs. Support for JSR- 
135 does not automatically imply support for audio or 
video recording or for taking snapshots. Even when au- 
dio recording is supported, the API permits only batch 
recording to an array or file (rather than a continuous 
stream) resulting in gaps. 

We ran into a bug on the Nokia N80: after recording 
audio segments for about 20 minutes, the JVM would 
crash. Other Nokia phones with the same operating sys- 
tem (Symbian S60 3rd Edition) exhibited the same bug. 
We worked around this bug using a simple Python script 
that runs on the phone and accepts requests to record au- 
dio or take a photograph through a TCP connection, re- 
turning the captured data also via TCP. The J2ME pro- 
gram acquires audio by sending a request to this Python 
script, which can record indefinitely without crashing. 


The J2ME partition of the Wishbone program uses 
TCP to stream partially processed results to the server. 
When the J2ME connects, the phone asks the user to 
choose an IP access point; we normally use a WiFi con- 
nection, but the user can also choose a cellular IP con- 
nection. With any of these communication methods, de- 
pendence on user interaction presents a practical barrier 
to using phones in an autonomous sensor network. Yet 
these software limitations are incidental rather than fun- 
damental, and should not pose a long-term problem. 


7 Evaluation 


In this section we evaluate the Wishbone system on the 
EEG and speech detection applications we discussed in 
Section 6. We focus on two key questions: 


1. Can Wishbone efficiently select the best partitioning 
for a real application, across a range of hardware 
devices and data rates? 

2. In an overload situation, can Wishbone effectively 
predict the effects of load-shedding and recommend 
a “good” partitioning? 
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Figure 5: Relationship between partitioning and compute-bound sustainable data rates. On the left (a), a subset of the EEG 
application (one channel). The X axis shows a required data rate, the Y axis the number of operators in computed optimal node 
partition. On the right (b), the speaker detection application; we flip the axes due to the small number of viable cut-points. For each 
viable cut-point, we show the maximum data-rate supported on each hardware platform. 


7.1 EEG Application 


Our EEG application provides an opportunity to explore 
the scaling capability of our partitioning method. In par- 
ticular, we look at our worst case scenario—partitioning 
all 22-channels (1412 operators). As the CPU budget in- 
creases, the optimal strategy for bandwidth reduction is 
to move more channels to the nodes. On our lower- 
power platforms, not all the channels can be processed 
on one node. The graph in Figure 5(a) shows partition- 
ing results only for the first of 22 channels, where we 
vary the input data rate on the X axis and measure the 
number of operators that “fit” on different platforms. We 
ran 1p_solve to derive a partitioning 2100 times, linearly 
varying the data rate to cover everything from “every- 
thing fits easily” to “nothing fits”. To remove confound- 
ing factors, the objective function was configured to min- 
imize network bandwidth subject to not exceeding CPU 
capacity (a = 0,7 = 1): that is, allow the CPU to be 
fully utilized (but not over-utilized). As we increased the 
data rate (moving right), fewer operators can fit within 
the CPU bounds on the node (moving down). The slop- 
ing lines show that every stage of processing yields data 
reductions. 


The distribution of resulting execution times are de- 
picted as two CDFs in Figure 6, where the x axis shows 
execution time in seconds, on a log scale. The top curve 
in Figure 6 shows that even for this large graph, 1p_solve 
always found the optimal solution in under 90 seconds. 
The typical case was much better: 95 percent of the ex- 
ecutions reached optimality in under 10 seconds. While 
this shows that an optimal solution is typically discov- 
ered in a reasonable length of time, that solution is not 
necessarily known to be optimal. If the solver is used 
to prove optimality, both worst and typical case runtimes 
become much longer, as shown by the lower CDF curve 
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Figure 6: CDF of the time required for 1p_solve to reach an 
optimal partitioning for the full EEG application (1412 oper- 
ators), invoked 2100 times with data rates. The higher curve 
shows the execution time at which an optimal solution was 
found, while the lower curve shows the execution time required 
to prove that the solution is optimal. Execution times are from 
a 3.2 GHz Intel Xeon. 


(yet still under 12 minutes). To address this, we can use 
an approximate lower bound to establish a termination 
condition based on estimating how close we are to the 
optimal solution. 


7.2 Speech Detection Application 


The speech detection application is a linear pipeline of 
only a dozen operators. Thus the optimization process 
for picking a cut point should be trivial—a brute force 
testing of all cut points will suffice. Nevertheless, this 
application’s simplicity makes it easy to visualize and 
study, and the fact that the data rate it needs to process all 
data is unsustainable for TinyOS devices provides an op- 
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portunity to examine the other side of Wishbone’s usage: 
what to do when the application doesn’t fit. 


In applying Wishbone to the development process for 
our speech detection application, we were able to quickly 
assess the performance on several different platforms. 
Figure 7 is a detailed visualization of the performance 
trade-offs, showing only the profiling results for TMote 
Sky (a TinyOS platform). In this figure, the X axis repre- 
sents the linear pipeline of operators, and the Y axis rep- 
resent profiling results. Each vertical impulse represents 
the number of microseconds of CPU time consumed by 
that operator per frame (left scale), while the line repre- 
sents the number of bytes per second output by that op- 
erator. It is easy to visualize the trade-off between CPU 
cost and data rate. Each point on the X-axis represents a 
potential graph cut, where the sum of the red bars to the 
left provides the processing time per frame. 


Thus, we see that the MFCC dataflow has multiple 
data-reducing steps. The algorithm must natively process 
40 frames per second in real time, or one frame every 
25 ms. The initial frame is 400 bytes; after applying the 
filter bank the frame data is reduced to 128 bytes, using 
250 ms of processing time; after applying the DCT, the 
frame data is further reduced to 52 bytes, but using a total 
of 2 s of processing time. This structure means that al- 
though no split point can fit the application on the TMote 
at the full rate, we can achieve different CPU/bandwidth 
trade-offs by selecting different split points. Selecting 
a bad partitioning can result in retrieving no data, and 
the best “working” partition provides 20 times more data 
than the worst. Figure 5(b) shows an axes-flipped ver- 
sion of Figure 5(a): predicted data-rate as a function of 
the partition point. Only viable (data reducing) cutpoints 
are shown. Bars falling under the horizontal line indicate 
that the platform cannot be expected to keep up with the 
full (8 kHz) data rate. 

As expected, the TMote is the worst performing plat- 


form, with the Nokia N80 performing only about twice 
as fast—surprisingly poor performance given that the 
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N80 has a 32-bit processor running at 55X the clock rate 
of the TMote. This is due to the poor performance of 
the JVM implementation. The 412 MHz iPhone plat- 
form using GCC performed 3X worse than the 400 MHz 
Gumstix-based Linux platform; we believe that this is 
due to the frequency scaling of the processing kicking in 
to conserve power. 

We can also visualize the relative performance of dif- 
ferent operators across different platforms. For each plat- 
form processing the complete operator graph, Figure 8 
shows the fraction of time consumed by each operator. If 
the time required for each operator scaled linearly with 
the overall speed of the platform, all three lines would be 
identical. However, the plot clearly shows that the dif- 
ferent capabilities of the platforms result in very differ- 
ent relative operator costs. For example, on the TMote, 
floating point operations, which are used heavily in the 
cepstrals operator, are particularly slow. This 
shows that a model that assumes the relative costs of op- 
erators are the same on all platforms would mis-estimate 
costs by over an order of magnitude. 


7.3. Wishbone Deployment 


To validate the quality of the partitions selected by Wish- 
bone, we deployed the speech detection application on 
a testbed of 20 TMote Sky nodes. We also used this 
deployment to validate the specific performance predic- 
tions that Wishbone makes using profiling data (e.g., if 
a combination of operators were predicted to use 15% 
CPU, did they”). 


7.3.1 Network Profiling 


The first step in deploying Wishbone is to profile the 
network topology in the deployment environment. It is 
important to note that simply changing the network size 
changes the available per-node bandwidth and thus re- 
quires re-profiling of the network and re-partitioning of 
the application. We run a portable WaveScript program 
that measures the goodput from each node in the net- 
work. This tool sends packets from all nodes at an iden- 
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Figure 9: Loss rate measurements for a single TMote plus 
basestation across different partitionings. Lines show the per- 
centage of input data processed, the percentage of network 
messages received, and the product of these: the goodput. 


tical rate, which gradually increases. For our 20 node 
testbed the resulting network profile is typical for TMote 
Sky devices: each node has a baseline packet drop rate 
that stays steady over a range of sending rates, and then 
at some drops off dramatically as the network becomes 
excessively congested. Our profiling tool takes as input a 
target reception rate (e.g. 90%), and returns a maximum 
send rate (in msgs/sec and bytes/sec) that the network 
can maintain.For the range of sending rates within this 
upper bound the assumption mentioned in 4.3 holds— 
attempting to send more data does not result in actual 
bytes of data received. Thus we are free to maximize 
the data rate within the upper bound provided by the net- 
work profiling tool, and thereby maximize total applica- 
tion throughput. This enables us to use binary search to 
find the the maximum sustainable data rate when we are 
in an overload situation. 

To empirically verify that our computed partitions are 
optimal, we established a ground truth by exhaustively 
running the speech detection application at every cut 
point on our testbed. Figures 9 and 10 show the results 
for six relevant cutpoints, both for a single node network 
(testing an individual radio channel) and for the full 20 
node TMote network. Wishbone counts missed input 
events and dropped network messages on a per-node ba- 
sis. The relevant performance metric is the percentage 
of sample data that was fully processed to produce out- 
put. This is roughly the product of the fraction of data 
processed at sensor inputs, and the fraction of network 
messages that were successfully received. 

Figure 9 shows the input event loss and network loss 
for the single TMote case, as well as the resulting good- 
put. On a single mote, the data rate is so high at early 
cutpoints that it drives the network reception rate to zero. 
At later cutpoints too much computation is done at the 
node and the CPU is busy for long periods, missing in- 
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Figure 10: Goodput rates for a single TMote and for a network 
of 20 TMotes, over different partitionings when running on our 
TMote testbed. 


put events. In the middle, even a underpowered TMote 
can process 10% of sample windows. This is equivalent 
to polling for human speech four times a second—a rea- 
sonably useful configuration. 

Figure 10 compares the goodput achieved with a sin- 
gle TMote and basestation to the case of a network of 20 
TMotes. For the case of a single TMote, peak through- 
put rate occurs at the 4th cut point (filterbank), while for 
the whole TMote network in aggregate, peak throughput 
occurs at the 6th and final cut point (cepstral). As ex- 
pected, the throughput line for the single mote tracks the 
whole line closely until cut point six. For a high-data 
rate application with no in-network aggregation, a many 
node network is limited by the same bottleneck as a net- 
work of only one node: the single link at the root of the 
routing tree. At the final cut point, the problem becomes 
compute bound and the aggregate power of the 20 TMote 
network makes it more potent than the single node. 

We also ran the same test on an a Meraki Mini based 
on a low-end MIPS processor. While the Meraki has rel- 
atively little CPU power—only around 15 times that of 
the TMote—it has a WiFi radio interface with at least 
10x higher bandwidth. Thus for the Meraki the optimal 
partitioning falls at cut point 1: send the raw data directly 
back to the server. 

Having determined the optimal partitioning in our 
real deployment, we can now compare it to the recom- 
mendation of our partitioning algorithm. Doing this is 
slightly complex as the algorithm does not model mes- 
sage loss; instead, it keeps bandwidth usage under the 
user-supplied upper bound (using binary search to find 
the highest rate at which partitioning is possible), and 
minimizes the objective function. In the real network, 
lost packets may cause the actual delivered bandwidth 
to be somewhat less than expected by the profiler. For- 
tunately, the cut-point that maximizes throughput should 
be the same irrespective of loss as CPU and network load 
scale linearly with data rate. 
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In this case, binary search found that the highest data 
rate for which a partition was possible (respecting net- 
work and CPU limits) was at 3 input events per second 
(with each event corresponding to a window of 200 au- 
dio samples). The optimal partitioning at that data rate* 
was in fact cut point 4, right after filterbank, as in the 
empirical data. Likewise, the computed partitions for 
the 20 node TMote network and single node Meraki test 
matched their empirical peaks, which gives us some con- 
fidence in the validity of the model. 

In the future, we would like to further refine the preci- 
sion of our CPU and network cost predictions. To use 
our ILP formulation we necessarily assume that both 
costs are addititve—two operators using 10% CPU will 
together use 20%, and don’t account for operating sys- 
tem overheads or processor involvement in network com- 
munication. For example, on the Gumstix ARM-linux 
platform the entire speaker detection application was pre- 
dicted to use 11.5% CPU based on profiling data. When 
measured, the application used 15% CPU. Ideally we 
would like to take an automated approach to determin- 
ing these scaling factors. 


$8 Related Work 


First we overview other systems that, like Wishbone, 
automatically partition programs—either dynamically or 
statically—to run on multiple devices. Generally speak- 
ing, Wishbone differs from these existing systems by us- 
ing a profile-driven approach to automatically derive a 
partitioning, as well as its support for diverse platforms. 

The Pleiades/Kairos systems [13] statically partition 
a centralized C-like program into a collection of node- 
level nesC programs that run on motes. Pleiades is pri- 
marily concerned with the correct synchronization of 
shared state between nodes, including consistency, seri- 
alizability, and deadlocks. Wishbone, in contrast, is con- 
cerned with high-rate shared-nothing data processing ap- 
plications, where all nodes run the same code. Because 
Wishbone programs are composed of a series of dis- 
crete dataflow operators that repeatedly process stream- 
ing data, they are amenable to our profile-based approach 
for cost estimation. Finally, by constraining ourselves 
to a single cut point, we can generate optimal partition- 
ings quickly, whereas Pleiades uses a heuristic partition- 
ing approach to generate a number of cut points. 

Triage [3] is a related system for “microservers”’ that 
act as gateways in sensor network applications. Triage’s 
focus 1s On power conservation on such servers by using a 
lower-power device to wake a higher-power device based 
on a profile of expected power consumption and utility 
of data coming in over the sensor network. However, 


*TIn this case with a = 0, (2 = 1, although the linear combination 
in the objective function is not particularly when we are maximizing 
data rate we are saturating either CPU or bandwidth 


it does not attempt to automatically partition programs 
across the two device classes as Wishbone does. 


In stream processing there has been substantial work 
looking at the problem of migrating operators at run- 
time [2, 18]. Dynamic partitioning 1s valuable in environ- 
ments with variable network bandwidth, unpredictable 
load, but also comes with serious downsides in terms 
of runtime overheads. Also, by focusing on static par- 
titioning, Wishbone is able to provide feedback to users 
at compile time about whether their program will “fit” 
their sensor platform and hardware configuration. 


There has been related work in the context of tradi- 
tional, non-sensor related distributed systems. For ex- 
ample, the Coign [11] system automatically partitions 
binary applications written using the Microsoft COM 
framework across several machines, with the goal of 
minimizing communication bandwidth. Like Wishbone, 
it uses a profile-driven approach. Unlike Wishbone, 
Coign does not formulate partitioning as an optimiza- 
tion problem, and only targets Windows PCs. Neubauer 
and Thiemann [15] present a similar framework for parti- 
tioning client-server programs. Automatic partitioning 1s 
also widely-used in high-performance computing, where 
it is usually applied to some underlying mesh, and in au- 
tomatic layout of circuits. Finally, several systems, 1n- 
cluding JESSICA2 [25], MagnetOS [4], and cJVM [1], 
implement distributed Java virtual machines that appear 
as a single system. These systems must use runtime 
methods to load-balance threads between machines. The 
overheads on communication and synchronization are 
typically high, and only applications with a high ratio 
of computation to communication will scale effectively. 


Tenet [9] proposes a two-tiered architecture with pro- 
grams decomposed across sensors and a centralized 
server, much as in Wishbone. The VanGo system [10], 
which is related to Tenet, proposes a framework for 
building high data rate signal processing applications in 
sensor networks, similar to the applications that inspired 
our work on Wishbone. But VanGo is constrained to a 
linear chain of filters, does not support automatic parti- 
tioning, and runs only TinyOS code. 


Marionette [24] and Spatial Views [17] use static par- 
titioning of programs between sensor nodes and a server 
that is explicitly under the control of the programmer. 
These systems work by allowing users to invoke pre- 
defined handlers (written in, for example, nesC) from a 
high-level centralized program that runs on a server, but 
neither offers automatic partitioning. 


Abstract Regions [22] and Hood [23] enable opera- 
tions over clusters of nodes (or “regions’’) rather than sin- 
gle sensors. They allow data from multiple nodes to be 
combined and processed, but are targeted at coordinating 
sensors rather than stream processing. 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 


407 


408 


9 Future Work/Conclusions 


The model presented in this paper enables communica- 
tion between embedded endpoints and a central server. 
But it would be straightforward to extend our model 
with a basic form of in-network aggregation: namely, 
tree-based aggregation that happens at every node in the 
network, useful, for example, for taking average sensor 
readings. This communication pattern would be exposed 
as a “reduce” operator that would reside in the logical 
node partition, but would implicitly take its input not just 
from streams within the local node, but from child nodes 
routing through it in an aggregation tree. The partition- 
ing algorithm remains the same. If the reduce operator 
is assigned to the embedded node, aggregation happens 
in-network, otherwise all data is sent to the server. 

Also, while our prototype implementation only sup- 
ports networks of one type of node, the model can also 
handle certain kinds of mixed networks. A single log- 
ical node partition can take on different physical parti- 
tions at different nodes. This is accomplished simply by 
running the partitioning algorithm once for each type of 
node. The server would need to be engineered to deal 
with receiving results from the network at various stages 
of partial processing. In the future, mixed partitions may 
be desirable even for homogeneous networks. Varying 
wireless link quality can create a situation where each 
node should partitioned differently. 

A more radical change would extend the model with 
multiple logical partitions corresponding to categories of 
devices. This opens up several design choices; for exam- 
ple, what communication relationship should the logical 
partitions should have? We have verified that we can use 
an ILP approach for a restricted three tier network ar- 
chitecture. (Motes communicate only to microservers, 
and microservers to the central server.) But going further 
would require revisiting the partitioning algorithm. 


Acknowledgements: This work was supported by the 
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Abstract 


Voice over IP (VoIP) in 802.11 wireless networks 
(WiFi) is an attractive alternative to cellular wireless tele- 
phony. Unfortunately, VoIP traffic is well known to make 
inefficient use of such networks. Indeed, we demon- 
strate that increasing handset deployment has the poten- 
tial to cripple existing hotspot and enterprise WiFi net- 
works. Our experiments show that VoIP halves the avail- 
able TCP capacity of an 802.11b hotspot when six to 
eight VoIP stations share the medium, and effectively 
extinguishes TCP connectivity when ten VoIP stations 
are present. Further, we show that neither the higher 
data rates of 802.1 1a/g nor the 802.11 standard for qual- 
ity of service, 802.lle, fully ameliorate the problem. 
Instead, the problem is rooted in WiFi’s contention- 
based medium-access control mechanism and consider- 
able framing overhead. 

To remedy this problem, we propose Softspeak, a pair 
of backwards-compatible software extensions that en- 
ables VoIP traffic to share the channel in a more effi- 
cient, TDMA-like manner. Softspeak does not require 
any modifications to the WiFi protocols and significantly 
reduces the impact of VoIP on TCP capacity while si- 
multaneously improving key VoIP call-quality metrics. 
Results show improvements in TCP download capacity 
of 380% for 802.11b and 25-200% for 802.1 1g. 


1 Introduction 


Voice-over-IP (VoIP) technology is now pervasive in 
wire-line networks, embodied by wildly successful ap- 
plications like Skype. Wireless deployment, in contrast, 
has so far been limited to certain niche products. Re- 
cently, however, WiFi-capable consumer phone handsets 
such as T-Mobile’s UMA and the Apple iPhone have 
been released to the US market in large numbers, por- 
tending a huge influx of WiFi VoIP users once third-party 
applications like iCall [1] become widely available for 
these platforms. In the near future, it may not be unusual 
for a dozen active WiFi VoIP handsets to be in range of a 
single WiFi hot-spot, for example at a local Starbucks. 
One might imagine that such a scenario would be eas- 
ily supported by existing installations, as VoIP is a rela- 
tively low-bandwidth protocol. For example, given an 


802.11b channel with 11 Mbps of capacity, a G.729! 
VoIP codec rate of 6.4 Kbps, and a combined header 
size of RTP, UDP and IP of 40 bytes, one might ex- 
pect a single AP to support over 70 bidirectional VoIP 
calls and still leave half of the channel capacity for data 
traffic. It is well known, however, that nothing could be 
further from the truth; previous researchers have shown 
that an 802.11b network supports as few as six simulta- 
neous VoIP sessions [4, 9, 20], depending upon the par- 
ticular characteristics of the network and codecs in use. 
This counterintuitive result is due to the large per-packet 
overhead imposed by WiFi for each VoIP packet—both 
in terms of protocol headers and due to WiFi contention. 


Call quality has traditionally been a major concern for 
WiFi VoIP deployments, since real-time audio traffic has 
stringent requirements in terms of loss rate, delay and 
jitter, and needs to be sent at a high rate (e.g., 50-100 
packets per second for many VoIP codecs) to maintain 
acceptable audio quality. In mixed-use cases, best-effort 
traffic can cause excessive queuing of VoIP traffic at ac- 
cess points and may increase packet loss rate due to con- 
tention for the medium. Since a VoIP call occupies only a 
very small amount of bandwidth (possibly as few as eight 
bytes of voice data per packet), many researchers [4, 25] 
and commercial providers [2] have proposed prioritizing 
VoIP packets, with the unstated assumption that the 1m- 
pact on overall network performance will be minimal. 
However, as we demonstrate experimentally, as few as 
six VoIP calls may remove over half of the TCP capac- 
ity in 802.11b. Moreover, prioritizing VoIP sessions runs 
the very real danger of drowning out all competing best- 
effort traffic, such as Web browsing and email messag- 
ing. Somewhat surprisingly, our experiments show that 
neither the increased speed of 802.11a/g nor the quality- 
of-service mechanisms of 802.1 le change this reality. 


In this paper, we address the impending potential dis- 
aster: that widespread VoIP usage will cripple hotspot 
and enterprise WiFi networks. In addition to quantify- 
ing and explaining the impact of VoIP on the capacity 
of WiFi, we propose backward-compatible modifications 
to 802.11 that aggregate multiple VoIP clients into the 
equivalent of a single VoIP client, thus reducing VoIP’s 
impact on the network’s data-carrying capacity. 
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Previous work in this domain has proposed the con- 
cept of ‘downlink aggregation’ in simulation [23, 24], 
which encapsulates multiple VoIP packets into a single 
packet at the AP, addressed to all VoIP stations associated 
with the same AP. Our experiments demonstrate, how- 
ever, that downlink aggregation is insufficient to fully 
address the problem. We present a complementary tech- 
nique for the uplink direction that serializes channel ac- 
cess by establishing a TDMA-like schedule. We show 
that this can be done in a distributed manner by inde- 
pendent VoIP stations. We combine uplink TDMA and 
downlink aggregation mechanisms to develop a system 
called Softspeak that simultaneously improves VoIP call 
quality while preserving network capacity for best-effort 
data transfer. 

We implement and evaluate Softspeak on a testbed of 
Linux-based 802.11b/g/e devices within an operational 
enterprise WiFi network. We show that Softspeak im- 
proves residual downlink TCP capacity of the network 
substantially, e.g., by 380% in the presence of ten VoIP 
calls in 802.11b and by 200% in 802.11¢ (protected 
mode). We also achieve significant improvements in 
UDP and TCP uplink capacity, as well as in 802.11g un- 
protected mode. Furthermore, we show that Softspeak 
can improve VoIP call quality, providing an important in- 
centive for client deployment. To the best of our knowl- 
edge, our work is the first to present a system based on 
commodity hardware that performs both uplink TDMA 
and downlink aggregation to improve the performance 
of multiple, simultaneous VoIP sessions while increasing 
the residual data-carrying capacity of the WiFi network. 


2 The impact of VoIP on WiFi 


In this section we empirically demonstrate the degrada- 
tion of WiFi network capacity as well as VoIP call quality 
in the presence of an increasing number of VoIP clients. 
We then employ a detailed simulation of the 802.11 DCF 
algorithm to determine the precise source of the problem. 


2.1 Sources of overhead 


The 802.11 protocol is designed to allow clients to access 
the channel in a distributed manner. Uncoordinated ap- 
proaches are known to be inefficient under heavy load as 
collisions become more frequent and the total airtime uti- 
lization of the wireless channel reduces dramatically due 
to airtime wasted on garbled frames. This problem is par- 
ticularly relevant in the case of VoIP traffic, since VoIP 
clients contend often due to the real-time nature of the 
traffic. The resulting increased collision rate increases 
loss and jitter, which in turn degrade TCP performance 
and harm VoIP call quality. 

Furthermore, given the small data payload of VoIP 
packets the overhead of transmitting the various head- 
ers in a VoIP packet becomes considerable: each VoIP 
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packet in a WiFi network is typically encumbered with 
RTP, UDP, IP, MAC and PHY headers as well as a syn- 
chronous 802.11 ACK frame. For example, a G.729 
packet may take 157 js to transmit at the maximum rate 
in 802.11b, or 273 js if we include the ACK frame (and 
assume it is sent at maximum rate). Of this time, the 
eight bytes of voice data carried inside the packet take 
up only six microseconds; the entire IP packet requires 
only 35 ps of airtime, resulting in 680% overhead. Al- 
though 802.11¢ can reduce this overhead to 240% in the 
best case, the overhead remains substantial at over 400% 
(again optimistically assuming maximum rates are used) 
in protected mode, which is required when any legacy 
802.11b device is present. 

Additionally, airtime usage may increase in response 
to loss rate, as rate control algorithms frequently lower 
the transmission rate in response to loss, regardless of 
whether the loss was due to poor signal quality or frame 
collision. Finally, we note that the resulting increase in 
airtime scarcity in turn tends to increase collision proba- 
bility and loss rate as more stations attempt to seize the 
channel at once, thereby completing a vicious circle. 


2.2 Experimental observation 


To quantify the impact of VoIP traffic on background data 
transmissions, we have configured a testbed to reflect a 
realistic scenario for VoIP usage in the enterprise: sta- 
tions sending and receiving VoIP traffic are spread out 
over several offices and are connected to an operational 
building-wide wireless network. For controlled exper- 
imentation we ensure that all stations associate to the 
same AP and do not roam between different APs. We 
use wireless cards from two different manufacturers to 
ensure our results are not artifacts of a particular piece of 
hardware and consider 802.11b, g and e. (Full details of 
the testbed are included in Section 4.1.) Unless specified 
otherwise, all experiments employ a 10-ms G.729 codec. 


2.2.1 Residual capacity 


We are interested in the residual WiFi capacity as well 
as VoIP call quality in the presence of a varying number 
of VoIP stations. Here, we measure the residual capacity 
by simultaneously running a bulk flow and measuring its 
throughput. We conduct separate experiments for uplink 
and downlink bulk flows, using both TCP and UDP. Our 
experiments with UDP measure the raw channel capacity 
available, while TCP measures the effective capacity for 
flows that are sensitive to loss and delay. For simplicity, 
we restrict our discussion to experiments using a single 
non- VoIP flow at a separate client; we present results for 
multiple data clients in Section 4.5. 

Figure 1 plots the throughput of TCP in the presence 
of a varying number of VoIP stations in an 802.11b net- 
work. As we increase the number of VoIP streams, the 
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Figure 1: TCP throughput as a function of the number 
of VoIP streams in 802.11b (Avaya AP-8 access point). 


throughput of a TCP uplink flow (where “uplink” refers 
to the direction of the TCP data packets) degrades, halv- 
ing at around eight VoIP streams. In typical TCP us- 
age (e.g., Web traffic) more throughput is required from 
the downlink direction than from the uplink direction. 
Unfortunately, throughput degradation is far worse for a 
TCP downlink flow, which can be explained as follows. 
TCP’s congestion control mechanism attempts to use the 
maximum bandwidth available given the loss rate and the 
RTT. For both cases, the TCP sender needs to share the 
AP with other traffic for its downlink traffic (data pack- 
ets for TCP downlink or ACK packets for TCP uplink), 
and it is therefore at the AP that most losses are expected 
to occur. Losing a data packet is far worse than losing 
an ACK packet, however. Therefore, TCP is able to tol- 
erate a higher loss rate at the AP and achieve a higher 
throughput when sending data uplink. As a result, TCP 
downlink throughput halves at six VoIP streams and de- 
grades by over 85% in the presence of ten VoIP streams. 

UDP throughput degradation is less severe than that 
of TCP because UDP 1s less sensitive to loss and delay. 
Nevertheless we observe a significant throughput degra- 
dation (over 55% with ten VoIP sessions). We further 
note that the behavior of uplink UDP and TCP traffic 
and their impact on VoIP traffic appears quite similar, 
indicating that in our testbed the TCP uplink behavior is 
characterized mostly by channel capacity, rather than by 
loss and delay. 


2.2.2 Call quality 


As we increase the number of simultaneous VoIP ses- 
sions, the individual call quality also decreases. Call 
quality is a function of packet loss rate, delay and de- 
lay jitter, and is typically represented as a Mean Opin- 
ion Score (MOS) ranging from 1 (bad) to 5 (good). We 
use an approximation of MOS based on network-level 
metrics [6] with codec-specific parameters calibrated us- 
ing simulation [7]. We assume a playout buffer that is 


able to adapt its de-jitter delay such that on average no 
more than 1% of packets are late. We find that in the 
presence of TCP and bulk UDP uplink traffic, MOS de- 
creases from 3.8 to | as the number of VoIP stations in- 
creases from one to ten. In these cases VoIP traffic under- 
goes severe loss (reaching 50%) due to drop-tail queuing 
at the AP queue where it competes with bulk data or TCP 
acknowledgments. Conversely, TCP downlink traffic is 
suppressed by VoIP traffic to such an extent that the VoIP 
MOS remains relatively unaffected. A major challenge is 
thus to improve TCP downlink performance without sac- 
rificing call VoIP quality. 


2.2.3 802.11 protocol extensions 


To evaluate whether higher bit rates alleviate problems of 
contention and overhead we perform the same set of ex- 
periments using 802.11g. We find that throughput degra- 
dation is less severe in pure 802.11g networks than in 
802.11b. For example, TCP downlink performance does 
not drop as sharply as it does in 802.11b, but degrades 
in a similar way to TCP uplink and UDP performance. 
The loss in capacity when ten VoIP clients are present is 
still substantial, however, ranging from a 32% reduction 
in the case of UDP downlink to 39% for TCP downlink 
traffic. Similarly, while VoIP MOS is higher in 802.11, 
it is still unacceptably low, dropping from 3.8 to 1.3 as 
the number of VoIP sessions increases from one to ten 
due to frequent losses. 

In practice, however, our enterprise WiFi deploy- 
ment almost never supports only 802.11g clients. For 
backwards compatibility, 802.11g requires a “protected 
mode” be used when 802.11b stations are detected. In 
protected mode an 802.11g station precedes each trans- 
mission by a clear-to-send (CTS) frame, thus increas- 
ing per-frame overhead. We observe that the capac- 
ity degradation caused by 802.11g VoIP clients in an 
802.11g protected-mode network is comparable to that 
of native 802.11b. Thus, the presence of a single legacy 
802.11b client (VoIP or otherwise) alongside ten VoIP 
clients removes 87% of TCP downlink capacity. In addi- 
tion, we find that whereas VoIP uplink loss is negligible 
in 802.11b in the presence of TCP downlink traffic, it 
varies from 10-40% in 802.11g protected mode, result- 
ing in an average VoIP MOS value of 2.0. 

The 802.1 le protocol is specifically designed to allow 
real-time and data traffic to co-exist efficiently by prior- 
itizing real-time traffic. We compare the performance of 
802.11b and 802.11b+e using a popular 802.1 le capa- 
ble access point (a Linksys WAP4400N, different from 
the Avaya AP-8 used in the previous experiments, which 
does not support 802.11e), with VoIP traffic configured 
to be classified and prioritized over other traffic at both 
the AP and the clients. In the presence of TCP uplink 
traffic, we observe that compared to 802.11b, 802.1 le 
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Figure 2: TCP uplink throughput as a function of the 
number of VoIP stations in both 802.11b and 802.11b+e 
(Linksys WAP4400N access point). 
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Figure 3: Uplink MOS of a 20-ms codec in 802.11g 
(protected mode) in the presence of TCP traffic. (Data 
points are slightly offset to avoid overlapping error bars.) 


does indeed improve the MOS of VoIP traffic. However, 
as shown in Figure 2, this improvement is achieved at the 
expense of TCP uplink throughput, which degrades far 
more severely than is the case for 802.11b. TCP down- 
link performance is essentially similar to that of 802.11b, 
with a slight improvement in MOS. We conclude that 
while 802.1 1e (at least as implemented by a popular AP 
vendor) is able to improve call quality in some cases, it 
does not mitigate throughput degradation in the presence 
of a large number of VoIP clients. 


2.2.4 Less aggressive codecs 

By combining multiple 10-ms voice frames into a sin- 
gle IP packet, G.729 can be run at longer inter-packet 
intervals, thereby making more efficient use of network 
resources. Figure 3 considers a 20-ms G.729 codec in 
combination with TCP in 802.11g protected mode. As 
expected, the impact is less than for a 10-ms codec yet re- 
mains severe; the MOS for uplink VoIP traffic drops from 
4 to 3 on average (compared to 2 in the 10-ms case) and, 
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more importantly, becomes highly erratic. Uplink and 
downlink TCP throughput reduce by around 40% (not 
shown, c.f. 87% in the 10-ms case for TCP downlink). 


2.3 §802.11b simulator 


While our experiments clearly demonstrate real-world 
performance problems, it is often difficult to determine 
to what extent the degradation measured is due to the 
802.11 protocol rather than interference, fading, hidden 
terminals, or other environmental factors. In order to 
cleanly separate these factors, we have implemented an 
802.11 protocol simulator that allows us to evaluate how 
aspects of the standard distributed coordination function 
(DCF) algorithm impact performance, in particular resid- 
ual capacity. We specifically omit the simulation of RF 
properties, rate adaptation, background broadcast traffic 
(e.g., DHCP and ARP), and hardware imperfections, in 
order to show that the DCF algorithm by itself explains 
our experimental observations of residual capacity. We 
focus on the percentage of time a client uses the medium, 
since it not only directly reflects bulk UDP throughput, 
but also indirectly reflects loss rate: in a DCF-based 
model losses are caused by colliding packets, which in 
turn occupy airtime. 


2.3.1 Configuration and validation 

The simulator contains objects representing the AP and 
wired and wireless stations that send UDP traffic (bulk 
traffic or based on the traffic characteristics of a VoIP 
codec). Wired stations are modeled as directly connected 
to the AP. The wireless stations and AP contend for ac- 
cess using the standard 802.11 DCF algorithm. We pa- 
rameterize the simulator to mimic the behavior of our 
testbed hardware (particular settings are detailed later in 
Table 1) and use a bit rate of 11 Mbps. We configure an 
AP queue length of 500 and station queue lengths of 10, 
but note that our simulation results are not sensitive to 
the choice of queue-length parameters. 

We simulate the 802.11b experiment described earlier 
for UDP and find that the results are very similar in air- 
time. For example, simulated throughput degradation is 
within 10% of the experimental results. The largest dif- 
ference between the simulated and experimental results 
is seen in the uplink VoIP loss rate which is 0.8—2.3% for 
ten VoIP stations versus less than 0.02% on the testbed. 


2.3.2 DCF’s share of VoIP impact 

Having established that our simulation exhibits a similar 
behavior as the testbed in 802.11b, and that a DCF-based 
model is sufficient to explain the degradation of residual 
capacity in our testbed under VoIP, we now analyze the 
simulation data to determine which aspect of DCF causes 
the observed behavior. Figure 4 shows the simulated air- 
time used by each of the following components: non- 
colliding bulk traffic (bulk), non-colliding VoIP uplink 
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Figure 4: Simulated airtime versus the number of VoIP 
streams, in the presence of 802.11b UDP uplink traffic. 


and downlink traffic (voipup, voipdown), colliding pack- 
ets (collisions), and times when all stations are backing 
off or sensing the medium (backoff). 

VoIP takes up a large fraction of the airtime, e.g., 40% 
for ten sessions, exceeding the airtime used by bulk traf- 
fic. Most of the VoIP airtime (35%) consists of fram- 
ing overhead. Additionally, 33% of total airtime is over- 
head due to contention (20% backoff plus 13% wasted on 
collisions). The techniques presented in the next section 
are capable of reducing a significant portion of overhead, 
specifically the framing overhead of downlink VoIP traf- 
fic (11%) and the collision time (13%). Based upon these 
numbers alone there is potential to almost double the 
residual channel capacity. 


3 Softspeak 


Softspeak targets the key challenges of excessive con- 
tention and framing to build a software-only solution 
that can be deployed on existing commodity hardware. 
The main idea is to aggregate voice traffic by combin- 
ing many small packets into larger ones, thereby reduc- 
ing per packet overhead. Others have observed that all 
downlink packets must pass through the AP; hence, the 
opportunity to aggregate exists at either at the AP itself 
or just before the packets are sent to the AP [23, 24]. 
However, physically aggregating uplink VoIP packets is 
challenging since there are multiple, independent VoIP 
senders. Instead, we propose a time-division multiple 
access (TDMA) scheme that approximates uplink aggre- 
gation to the extent that it provides a similar reduction 
in contention overhead. Our uplink TDMA scheme can 
function independently of the downlink scheme and re- 
quires only client-side modifications. Downlink aggre- 
gation, on the other hand, also requires either modifying 
the AP, or, more realistically, adding a separate “VoIP 
aggregator’ device upstream from the AP. Both mecha- 
nisms conform to the existing 802.11 specification and 
coexist with VoIP stations that do not use Softspeak. 


3.1 Uplink TDMA 


Our uplink approach reduces the amount of contention 
created by VoIP clients. Specifically, we alter the con- 
tention behavior of the VoIP clients to no longer con- 
tend with non-VolIP clients, and then devise a distributed 
mechanism to schedule the VoIP clients ina TDMA fash- 
ion so that they no longer contend with each other either. 

We remove the VoIP clients from the standard con- 
tention process by modifying their backoff behavior. In- 
stead of sensing the medium for the 802.11-mandated 
DCF inter-frame spacing (DIFS) followed by a random 
backoff before sending, a Softspeak VoIP client senses 
for a shorter period of time and does not perform back- 
off, thus preventing collisions with non-VoIP traffic. (In 
the absence of hidden terminals, collisions with ACKs 
are prevented by 802.11’s NAV mechanism.) This be- 
havior effectively prioritizes uplink VoIP traffic and 1m- 
proves call quality. (A similar mechanism is employed 
by a commercial product, SVP [2].) By itself, however, 
this alteration inhibits DCF’s ability to prevent collisions 
among the VoIP stations. In fact, when we simulate 
only two VoIP stations that sense for a short inter-frame 
spacing (SIFS) without backoff in combination with bulk 
traffic that uses standard contention, we find that neither 
VoIP station is able to sustain a viable VoIP session. 

To prevent VoIP stations from colliding with each 
other, we introduce coarse-grained time slots and con- 
struct a TDMA schedule for the VoIP clients. When used 
in combination with downlink aggregation, the downlink 
aggregator node can assign TDMA slots as well as per- 
form admission control, since it has knowledge of all the 
clients using our scheme. In the absence of a central- 
ized scheduler, we devise a distributed mechanism (Sec- 
tion 3.1.1) that leverages management frames within the 
802.11 protocol to allocate slots. 


3.1.1 Slot allocation and admission control 


In an ideal deployment, the network operator will have 
installed a Softspeak VoIP downlink aggregator that can 
assign slots for uplink TDMA. If all available slots are 
in use it can deny access to a new Softspeak client, in 
which case that client resorts to normal 802.11 DCF. In 
some scenarios, however, it may be easier for individ- 
ual clients to install Softspeak software than to convince 
network operators to install new hardware. Moreover, 
uplink TDMA is useful by itself, 1.e., without downlink 
aggregation, since it reduces contention by uplink VoIP 
stations. Hence, if clients are unable to locate a VoIP ag- 
gregator (Section 3.2 describes the registration process), 
they proceed with a distributed allocation process. 
Independent of how TDMA slots are allocated to 
clients, VoIP stations need to be synchronized in order 
to correctly use their assigned slots. Each client uses 
the periodic beacon frame broadcast by an 802.11 AP to 
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synchronize with other VoIP clients. Beacons are sent at 
fixed intervals (usually 100 ms), and, since they are sent 
by the AP at a low bit rate, are typically received by all 
clients. It is important to note that a VoIP client may also 
hear beacons from an AP other than the one to which 
it is associated. To use beacon-based synchronization, 
VoIP clients need two important pieces of information: 
a) The AP to whose beacons other nearby VoIP clients 
are synchronizing, and b) which TDMA slots they are 
using. The slot allocation process provides both pieces 
of information. In the case of distributed slot allocation 
each VoIP client encodes the information by temporarily 
spoofing its MAC address (6 octets) as follows: 


e The first three octets (known as the OUD) are taken 
from a reserved OUI address space to ensure the 
resulting address is valid and unique. 


e The next two octets are the same as the last two 
octets of the BSSID of the AP to whose beacons 
the VoIP station is synchronizing. 


e The last octet is used to denote the particular real 
time slot the VoIP station is using or wants to use. 


The main concern when coordinating clients is that 
there is no guarantee they can hear each other’s trans- 
missions. Hence, Softspeak clients coerce the AP into 
generating specially crafted packets that the other clients 
can hear. VoIP stations using uplink TDMA periodically 
(e.g. once a second) send directed Probe-Requests on the 
channel and to the AP to which they are currently associ- 
ated using the modified MAC address. The destination 
(unmodified) AP will respond with a Probe-Response 
packet whose destination is the VoIP station’s modified 
MAC address, which is heard by all associated clients. 

A new VoIP station that wants to use uplink TDMA 
first enters promiscuous mode for a few seconds to sense 
the channel to check if there are any special Probe- 
Response packets (easily identifiable by the first three 
octets of the destination MAC address), thus determin- 
ing which AP’s beacons are being used for synchroniza- 
tion and which slots are in use. If the VoIP client detects 
any such Probe-Responses, it extracts the encoded AP 
and uses that for TDMA synchronization. Otherwise it 
synchronizes using the AP with which it is associated. 
In either case, the VoIP client picks an unused slot and 
starts to periodically broadcast a Probe-Request with its 
source MAC address denoting its slot and the AP it is us- 
ing for synchronization. As before, the AP sends Probe- 
Responses which can be heard by new VoIP clients want- 
ing to join. Finally, when a VoIP station finishes its ses- 
sion it stops sending Probe-Requests. 

Our slot assignment scheme seamlessly supports dy- 
namic node arrivals and departures. Moreover, this 
scheme works even when nearby clients are associated to 
different APs, since a client may synchronize with an AP 
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Figure 5: Time series of transmission times by a single 
station, no synchronization. 


other than the one it is associated to. Finally, our scheme 
works if APs use various 802.11 security features since 
Probe-Request and Probe-Responses are always sent un- 
encrypted. We have deployed our scheme with an AP 
that employs MAC-address-based access control, WPA2 
or WEP encryption, and disabled SSID broadcasting. 

A drawback of the distributed allocation scheme as 
currently described 1s that it is unable to detect multi- 
ple clients attempting to allocate the same slot simulta- 
neously. We observe that this problem can be solved (or 
made unlikely to occur) by adding some bits of random- 
ness to the spoofed MAC address, allowing the clients to 
arbitrate among conflicting slot allocations. For exam- 
ple, the scheme may be extended by having VoIP clients 
announce the BSSID and the slot number in separate 
Probes, thus allowing room for some bytes to be set ran- 
domly by each client. 


3.1.2 Synchronizing TDMA slots 


To implement uplink TDMA, we modify the Ralink 
RT2560F wireless card protocol stack in Linux 2.6.21 
(without modifying the WiFi hardware or firmware). Ide- 
ally, once slots are allocated, each VoIP station contends 
for the channel in its assigned slot and refrains from con- 
tending outside its slot. By default, the Linux 2.6 ker- 
nel timer interrupt is programmed to fire every millisec- 
ond; we show later that this also happens to be close to 
the optimal granularity for VoIP slotting in 802.1 1b. Us- 
ing one-millisecond slots, a TDMA scheme can support 
ten simultaneous VoIP stations using a codec with 10- 
ms inter-packet arrival rate, or 20 stations using a 20- 
ms codec. Since 802.1 1la/g frames for these codecs take 
less airtime, Softspeak could use smaller slots, allowing 
a larger number of VoIP stations to be admitted; we have 
not yet implemented sub-millisecond slotting. 

A straightforward implementation of one-millisecond 
slotting is to suspend and resume transmission from 
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within Linux’s timer interrupt handler in accordance with 
a station’s assigned slot. However, the naive approach 
faces two problems: clock skew and timer inaccuracy. 
Figure 5 illustrates both. In this experiment, a single sta- 
tion uses iperf to emulate a G.729 VoIP codec with 
a 10-ms inter-packet arrival rate. We manually assign 
the station a static TDMA slot; there is little to no back- 
ground traffic on the same AP during the experiment 

In the figure, the x axis plots time in seconds, and the 
y axis shows the start time of each transmission modulo 
10,000 ys (10 ms). The figure shows the effect of the 
timer interrupt firing faster than 1,000 times per second 
as well as iperf sending slightly slower than the con- 
figured rate of 100 packets per second. If the timer inter- 
rupt and iperf operated at their correct rate, we would 
expect to see a single horizontal band corresponding to 
the station’s assigned slot. Instead, iperf schedules 
packets at a rate slower than the timer interrupt, and as 
a result iperf and the implemented TDMA slot drift with 
respect to each other. When iperf happens to send in- 
side the slot, a short almost horizontal line appears start- 
ing at the bottom of the slot (the slight upward slope of 
this line is the clock skew). Once transmissions reach the 
top of the slot, packets are buffered until the start of the 
next slot, causing the downward sloping lines. The slope 
is caused by the timer interrupt firing too fast. 

Different stations may exhibit different degrees of 
skew, possibly even varying across time. We address 
this issue by effectively slaving each station’s clock to an 
AP. Specifically, we reset the timer every time a station 
hears the periodic beacon frame from the AP that was as- 
signed during the slot allocation process. On the Soekris 
net4801 in our testbed, Linux uses the programmable in- 
terval timer (PIT) as its time interrupt source. Therefore, 
we modify the driver to reset the PIT every time it hears 
a beacon, which we have measured to be roughly once 
every 102—103 ms for the APs in our network. 

Manipulating the PIT timer in this way may conceiv- 
ably cause unintended timing artifacts in the station’s op- 
eration. Therefore, we have developed an alternative im- 
plementation that uses Linux’s high-resolution timers to 
schedule the VoIP slots and have observed a similar de- 
gree of synchronization. However, the results in this pa- 
per are based on manipulating the PIT timer. 


3.1.3. Controlling transmission timing 


An obvious complication with our scheme is that when 
a TDMA slot starts, a station other than the station that 
has been assigned the slot may already be transmitting a 
frame. At 11 Mbps a maximum-sized IP packet (1500 
bytes) together with ACK will take 1376 js, potentially 
delaying the station by that time from the start of its slot 
into the next slot.” In addition, the VoIP station may re- 
peatedly fail to capture the channel even while actively 
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contending. We address this challenge by letting the 
WiFi card driver adjust the way VoIP station contends 
for the channel during its assigned slot, a mechanism we 
term dynamic IFS (dynamic inter-frame spacing). 

In standard DCF, stations contend using an inter-frame 
spacing of SIFS + (2 - cwslot) followed by a random 
backoff. (By cwslot we denote an 802.11 contention- 
window slot—20 js in 802.1 1b—not Softspeak’s 1-ms 
TDMA slot.) We use the two 20-ys cwslot intervals 
starting at SIFS and (SIFS + cwslot), respectively, to 
(a) prioritize the VoIP traffic over non-VoIP traffic and 
(b) prioritize among different VoIP stations to avoid col- 
lisions. Accordingly, we let each station contend as fol- 
lows: Figure 6 considers a station sta; which 1s assigned 
TDMA slot 7. During the station’s assigned TDMA 
slot it contends with (SIFS + cwslot) (and no back- 
off). In slot 2 + 1, 1t contends with SIFS (and no back- 
off). In any other slot it contends as specified by DCF 
(SIFS + (2 - cwslot) + backoff). 

Now let us consider the scenario as illustrated in Fig- 
ure 7, in which a station sta; in TDMA slot z is delayed 
into the next TDMA slot (7 + 1) by an ongoing trans- 
mission and assume for the moment that sta;’s packet 
was ready at the start of the slot 7. After the transmis- 
sion has ended, stations sta; and sta;, 1 contend for the 
channel. However, due to the assigned contention pa- 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 


415 


416 


rameters, sta; is guaranteed to win over station sta;+1. 
Furthermore, after sta; has finished transmitting and re- 
ceived its ACK (after 430 ps for a large-payload G.711 
codec), there is still at least (2 ms - 1376 ps - 430 ps 
= 194 us) for sta;,1 to commence its transmission and 
therefore not contend in TDMA slot (2 + 2). It can be 
shown that in the absence of retransmissions, as long as 
(a) the duration of a VoIP frame is less than one TDMA 
slot and (b) the duration of a bulk frame is less than two 
TDMA slots, station 2 will never contend in slot (2 + 2). 
Even if due to, e.g., 802.11 retransmissions or imperfect 
control of timing by Softspeak, a station ends up con- 
tending ina TDMA slot other than 7 or (7 + 1), it will do 
so using conventional DCF contention parameters and do 
no worse than without our improvements. 

Figure 8 plots the transmission start times of ten VoIP 
stations, each assigned a separate TDMA slot, when 
competing against background traffic. In particular, a 
bulk UDP sender generates background traffic in the 
downlink direction to a separate wireless station. Us- 
ing dynamic IFS, the slotting is clearly defined: while 
the bands are longer than 1 ms due to delays caused by 
ongoing background traffic transmissions (as explained 
above), the majority of transmissions do not commence 
more than one slot away. 

The first slot (assigned to the VoIP station plotted in 
the first column of Figure 8) commences roughly 500 ps 
after the beacon time. This offset is caused by inevitable 
delays between the time that the beacon is generated by 
the AP and when it is received and processed by a station, 
and also between the time the station driver generates a 
packet for a particular slot and the time that it is trans- 
mitted. In particular, 400 us of this time is accounted for 
by beacon transmission time, the remainder consisting 
of processing delays in the station. While some of these 
processing delays may vary across different stations, as 
subsequent figures show, the delay is consistent enough 
across multiple stations with the same hardware config- 
uration that a station’s synchronization can be tuned for 
that hardware. 


3.2. Downlink aggregation 


Downlink aggregation introduces an aggregator compo- 
nent that is placed at or before the WiFi AP (uplink from 
the AP). The aggregator is on-path and transparently for- 
wards all traffic to and from the AP; non-VolIP traffic is 
forwarded without modification. The aggregator buffers 
VoIP frames destined for wireless stations and releases 
a frame encapsulating the buffered frames at a regular 
interval (every \/ ms, where M is the minimum packe- 
tization interval of the VoIP codecs in use.) By combin- 
ing all the VoIP sessions into one packet per codec in- 
terval, downlink aggregation can virtually eliminate the 
marginal header and contention overhead of additional 
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Figure 8: TDMA slotting by ten VoIP stations using dy- 
namic IFS in the presence of UDP downlink background 
traffic. Each column represents a distinct VoIP station. 


VoIP clients. There is a down side however: when the 
aggregator buffers a packet, it adds a constant delay of 
M/2 ms in expectation, e.g., 5 ms given a 10-ms codec. 

When a new Softspeak VoIP session starts up (or when 
the station roams to a different AP) it registers with 
the aggregator node, which we implement on a sepa- 
rate Linux machine. When the aggregator receives a 
downlink packet addressed to a registered VoIP client, it 
buffers the packet and combines it with all other buffered 
packets into a single encapsulated packet that it sends 
out at fixed intervals (e.g., 10 ms for G.729). The ag- 
gregator node uses the IP header information from the 
most recently heard uplink packet (say from station $'1) 
to construct a new frame. Addressing the packet to $1 
increases the likelihood that the packet will be acknowI- 
edged by a currently active VoIP client. We define an 
aggregation header that stores the set of destinations and 
original IP packet lengths for each station. The aggrega- 
tion header is prepended to the UDP header and packet 
payload for S1, and then the respective IP and UDP 
headers and payloads for the remaining buffered VoIP 
packets are appended. 

In contrast to previous proposals [23], we address the 
aggregated frame to only one of the VoIP stations; we 
configure the WiFi interface of each of the VoIP sta- 
tions to be in promiscuous mode to allow them to re- 
ceive the aggregated packets regardless of the destina- 
tion. The client passes aggregated packets to the Soft- 
speak module that de-encapsulates the packet, extracts 
the portion meant for the current station, and passes it up 
the networking stack. Because the aggregated packet is 
addressed to only one station, there will be at most one 
MAC-layer acknowledgment. Wang et al., on the other 
hand, propose the use of multicast in order to eliminate 
the MAC ACK frame. We preserve the ACK frame for 
two pragmatic reasons. First, in our experience, while 
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Table 1: 802.11b contention parameters measured for 
our wireless hardware. 


obviously unable to eliminate all loss, the single ACK 
frame is a cost-effective mechanism to protect the ag- 
gregated packet against many collisions. Secondly, and 
perhaps more importantly, commodity access points typ- 
ically transmit multicast frames only at a multiple of the 
beacon interval to inter-operate with clients in power- 
save mode, introducing intolerable delay. 


4 Evaluation 


We now evaluate the effect of downlink aggregation 
and uplink TDMA, both independently and in concert. 
In particular, we show that (a) our schemes signifi- 
cantly increase the available channel capacity while usu- 
ally maintaining—and sometimes improving—VoIP call 
quality, and (b) our implementation of Softspeak is close 
to optimal in terms of throughput improvement. 


4.1 Experimental testbed 


The wireless infrastructure in our building is a managed 
802.11b/g deployment of enterprise-class Avaya AP-8 
access points. There are multiple APs per floor which 
are configured to orthogonal channels to increase spatial 
diversity. We configure eleven Soekris net4801 boxes 
to act as VoIP stations. Each has two mini-PCI wire- 
less cards: an Atheros AR5212 chipset-based card and 
an Ralink RT2560F-based interface. The net4801 is a 
single-board based computer with a 266-MHz CPU run- 
ning the Linux operating system. To simplify our ex- 
periments, we emulate VoIP traffic using iperf. We 
use iperf to generate UDP traffic that mimics a com- 
monly used VoIP codec, G.729, at 10-ms inter-packet in- 
tervals. RTS/CTS is disabled on all Soekris boxes and 
APs. All experiments are conducted late at night to min- 
imize background wireless activity. 

We employ ten commodity PCs connected over wired 
gigabit Ethernet as endpoints for the (emulated) VoIP 
traffic generated by the Soekris boxes. Essentially, each 
PC-Soekris pair serves as a distinct bi-directional VoIP 
call. One additional PC-Soekris pair conducts a bulk 
transfer (TCP or UDP) to measure the residual capac- 
ity of the wireless channel in the presence of the VoIP 
traffic. The TCP receive-window size is configured to be 
large enough that our TCP transfers are never receive- 
window limited. Unless otherwise noted, bulk transfer 
is conducted through the Atheros card, while the Ralink 
interfaces send and receive VoIP traffic. 


Table 1 reports the default contention parameters for 
the various devices in our testbed as measured by the Jig- 
saw wireless monitoring infrastructure [5]. We note that 
neither the Atheros card nor the Avaya AP appears to 
double its contention window size on retries, in contrast 
with the default behavior specified by 802.11. 


4.2 Results for 802.11b 


Figures 9 and 10 compare bulk throughput and VoIP 
call quality across all combinations of applying uplink 
TDMA and/or downlink aggregation in 802.11b, for TCP 
uplink and downlink. The results for UDP bulk uplink 
(not shown) are similar to those of TCP uplink. We dis- 
cuss the case of UDP bulk downlink in Section 4.3. The 
most important conclusions are that (a) applying a com- 
bination of uplink TDMA and downlink aggregation im- 
proves residual bulk throughput, in some cases drasti- 
cally, (b) with one exception, call quality is preserved or 
greatly improved, (c) applying only one of uplink TDMA 
or downlink aggregation does not achieve these results 
across all three cases of bulk traffic load. 

We summarize the benefits of Softspeak (combined 
uplink TDMA and downlink aggregation) over 802.11, 
for the case of ten VoIP sessions, as follows: 


TCP uplink and UDP uplink: Capacity increases by 
around 50% (Figure 9(a)). Downlink VoIP im- 
proves from being completely unusable for VoIP to 
being usable (Figure 9(b)). The bulk of this 1m- 
provement comes from a reduction in downlink loss 
rate (from 55% to 4.8%) by downlink aggregation. 
However, uplink TDMA contributes significantly 
by further reducing the downlink loss rate (to 1.8%), 
resulting in a substantial increase in MOS. For up- 
link VoIP (Figure 9(c)) most of the MOS improve- 
ment comes from downlink aggregation, which re- 
duces the RTT from over 400 ms to below 25 ms by 
reducing queuing at the AP” 


TCP downlink: Capacity multiplies 4.8 times (380% 
increase) from 92 KB/s to 445 KB/s (Figure 10(a)). 
Unfortunately, VoIP downlink MOS degrades 
somewhat (Figure 10(b)). On closer examination, 
we find that downlink MOS suffers from an in- 
creased loss rate from downlink aggregated packets: 
since Softspeak’s downlink aggregation scheme re- 
ceives link-layer acknowledgments from only one 
VoIP client, only frame losses experienced by that 
client result in retransmission. Frame corruption ex- 
perienced by other clients remains unnoticed. We 
address this issue when we present our results for 
802.1 1g (Section 4.4) where higher frame rates may 
further increase the probability of frame corruption. 


While these results show that Softspeak improves the 
efficiency of 802.11b networks in the presence of VoIP 
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Figure 10: Impact of a varying number of VoIP stations in combination with TCP downlink traffic (802.11b). 


in terms of residual TCP capacity (while mostly preserv- 
ing VoIP call quality), an important question is whether 
further improvements to our implementation could be 
made. For example, it might be the case that our 1m- 
plementation of uplink TDMA lacks sufficient control 
of VoIP packet scheduling, causing collisions. An op- 
timal implementation (e.g., one that is implemented in 
the 802.11 hardware or firmware) might do a better job 
at controlling the emission of frames according to the 
TDMA schedule. 


To investigate to what extent further improvements 
may be made to our implementation (but while remain- 
ing faithful to Softspeak), we compare our results with 
those based on an emulation of an optimal implemen- 
tation. We emulate downlink aggregation by replacing 
the individual VoIP senders that generate downlink VoIP 
traffic by a single sender that generates packets of the 
size produced by the downlink aggregator, eliminating 
any jitter and loss potentially caused by the downlink 
aggregator. Furthermore, downlink packets are sent to, 
and their loss rate measured at, a single VoIP station, 
eliminating any losses due to imperfect overhearing. We 
emulate uplink TDMA by replacing the VoIP stations 
by a single VoIP station that sends packets on behalf 
of all VoIP stations, in other words, it sends packets at 
ten times the codec rate. The single VoIP station nat- 
urally serializes the transmission of uplink VoIP pack- 
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ets, thereby eliminating any collision among VoIP sta- 
tions. To minimize the probability of colliding with other 
traffic, it uses SIFS without backoff. In Figures 9 and 
10 the results of the emulation are plotted as an ‘opti- 
mal’ point for ten VoIP clients. In terms of capacity and 
uplink MOS, Softspeak achieves close to what is opti- 
mally achievable. For downlink MOS, consistent with 
our earlier observation, Softspeak performs worse than 
optimal due to imperfect overhearing. However, note that 
in Figure 10(b) even optimal Softspeak’s downlink MOS 
is worse than that of ‘no softspeak’. This may be ex- 
pected, given that (optimal) Softspeak enables TCP traf- 
fic to considerably increase network resource usage. For 
example, we measure a 25% increase in RTT (as well 
as an increased RTT variance) due to a higher AP queue 
occupation, which in turn explains the higher loss rate of 
downlink VoIP traffic. 


4.3 UDP and 802.l1le 


While Softspeak can improve the capacity available for 
bulk UDP downlink traffic in 802.11b networks (Ta- 
ble 2), it cannot simultaneously reduce the high VoIP 
downlink loss rate that result from competing with a 
CBR UDP flow. These losses are caused by the AP 
queue filling with bulk UDP downlink traffic, combined 
with the fact that UDP does not respond to increasing 
loss and delay. Similarly, when replacing a single bulk 
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[Metric NO Sp Sp] Spk+Pio 
Downlink bulk tput (KB/s) | 375 | 005 | Sol 
Downlink VoIPloss rate || 67% [61% | <01% 
Uplink VoIP loss rate | 0.82% 


Table 2: The effectiveness of combining Softspeak 
(Spk) with prioritization (Prio) in the presence of ten 
VoIP stations and downlink bulk UDP traffic (802.11b, 
simulated). (UDP throughput without VoIP is 924 KB/s.) 
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Figure 11: Simulated Softspeak airtime usage versus 
the number of active VoIP streams, in the presence of 
802.11b UDP uplink bulk traffic (c.f. Figure 4). 


TCP stream by a sufficiently large number of bulk TCP 
streams, the AP queue fills up with TCP packets causing 
large delay. These losses and delays can only be ame- 
liorated by adding prioritization at the AP: (aggregated) 
VoIP packets would therefore not be dropped regardless 
of the amount of non-VoIP traffic buffered at the AP. 
Luckily, prioritization is part of the 802.1 le standard. 


4.3.1 Prioritization 


Unfortunately, our testbed hardware cannot simultane- 
ously support 802.1le (supported only by the Atheros 
chipset) and Softspeak (which is currently only imple- 
mented for the Ralink interfaces). We therefore evaluate 
Softspeak combined with 802.1 le-like prioritization at 
the AP using our simulator. Consistent with our results 
in Section 2.3.2, our simulator produces results similar to 
those measured experimentally for the case of UDP with- 
out prioritization for the combination of uplink TDMA 
and downlink aggregation, and we therefore believe that 
we can extrapolate to the case of AP prioritization. Ta- 
ble 2 shows that when we combine Softspeak with pri- 
oritization, we not only achieve a 47% improvement on 
downlink bulk UDP capacity, but also improve VoIP loss 
rate compared to the baseline. 


4.3.2 Airtime utilization 

Implementing Softspeak in our simulator also allows us 
to isolate the source of our performance improvement. 
Figure 11 shows the simulated airtime plot correspond- 


Softspeak enabled 


Yes 





No measures | Fixed=l1b | Fixed=11b, 
optout 


0.63 a.) a2 023 


Yes, fixed Station I || 3.4 
Yes, fixed Station 2 || 2.7 + 1.0 3.2+0.81 | 3. 0.31 





Table 3: Downlink aggregation losses in the presence 
of TCP downlink traffic (802.11g protected mode). The 
values given are the average and standard deviation MOS 
across all downlink VoIP sessions. 


ing to Figure 4, but with uplink TDMA and downlink 
aggregation enabled (and no prioritization). The figure 
indicates that we have achieved our objective of convert- 
ing almost all time spent on downlink framing overhead 
and on collision into bulk data capacity. Consistent with 
the reduction in collision airtime we have also reduced 
the collision rate, thereby improving loss rate, jitter, and 
as a result, VoIP call quality and TCP throughput. 


4.4 Results for 802.11g 


For 802.11g we observe that Softspeak as currently de- 
scribed makes significant improvements in capacity (24— 
32% for ten VoIP stations), while maintaining or lower- 
ing jitter and VoIP uplink loss to negligible levels. Recall 
that when 802.11g runs in protected mode, TCP down- 
link capacity suffers tremendously in the presence of 
VoIP. Using Softspeak we are able to triple (increase by 
200%) the TCP downlink capacity for ten VoIP stations. 
However, Softspeak also introduces significant downlink 
VoIP loss, rising to 30% for some stations, where in some 
cases virtually none was experienced without enabling 
Softspeak. In the case of 802.11g protected mode this re- 
duces MOS from 3.7 to 2.8 on average and substantially 
increases the variance of MOS (Table 3, no measures). 
As noted in Section 4.2 for 802.11b, downlink ag- 
gregation is susceptible to frame corruption by any re- 
ceiver that is not the link-layer recipient of the aggre- 
gated packet, and the higher rates of 802.11g only in- 
crease the likelihood of frame corruption. Our solu- 
tion to this problem is three-fold. First, we observe that 
judiciously selecting a fixed station as the destination 
for aggregated packets may greatly alleviate loss: pick- 
ing a station that consistently experiences frame corrup- 
tion causes the AP to often retransmit aggregated frames 
thereby increasing each station’s probability of receiving 
a correct copy. For a particular choice of station (Sta- 
tion | in Table 3), we observe that the average downlink 
loss rate consistently reduces to below 2%, resulting in 
an average MOS of 3.4. However, the MOS variance re- 
mains high. Second, the selected station can be made to 
associate with the AP at a lower rate, causing aggregated 
packets to be transmitted at the lower rate and further re- 
ducing frame corruption. To test this, we force Station | 
to associate in 802.11b mode (fixed=//b in Table 3) and 
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Figure 12: 10-ms code VoIP in combination with TCP 
traffic (802.11¢g protected mode, two stations opt out). 
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Figure 13: 20-ms codec VoIP in combination with TCP 
traffic (802.11¢g protected mode, two stations opt out). 


obtain a MOS of 3.5 as well as reduced variance. Note 
that to avoid condemning one of the stations to low-rate 
communication, a dummy 802.11 receiver can be added 
to the downlink aggregator box (or placed separately) 
and made to associate at the lower rate. 


Our third measure is to have any remaining bad re- 
ceivers opt out of the downlink portion of Softspeak (not 
evaluated for Station 1). By de-registering with the ag- 
gregator, these clients receive separate VoIP frames as 
in the non-aggregated case (while continuing to measure 
loss rate from received aggregated packets to help de- 
cide whether and when to re-register). Note that these 
stations can still participate in uplink aggregation. To 
demonstrate that such a scheme can gracefully address 
this situation in practice, we evaluate all three measures 
when making a poor choice for the fixed station: Station 
2 in Table 3, which gives a low MOS value of 2.7. After 
making the fixed station associate in 802.1 1b (@mproving 
average MOS to 3.2), we find that two stations consis- 
tently experience a high loss rate and MOS. Once these 
two stations opt out of downlink aggregation, we arrive 
at a MOS of 3.5 with low variance (fixed=11b,optout). 


Of course, several of these measures have the potential 
of sacrificing much of the bulk traffic throughput gains 
that were obtained from downlink aggregation in the 
first place. We evaluate both TCP throughput and VoIP 
quality based on the above Station 2 and while apply- 
ing all three measures. Downlink TCP throughput (Fig- 
ure 12(a)) does not much suffer much from these coun- 
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Figure 14: VoIP in combination with bulk TCP traffic 
(802.11g protected mode, no opt-out). Only five VoIP 
stations are active. In (c) and (d) the remaining five sta- 
tions engage in Web traffic. The throughput measured is 
that of bulk TCP. 


termeasures: Softspeak continues to more than triple 
TCP downlink throughput. However, the resulting up- 
link TCP throughput (781KB/s, Figure 12(b)) is 12% 
less than the throughput achievable by Softspeak without 
enabling these countermeasures (not shown). Neverthe- 
less, even with the countermeasures enabled Softspeak 
is able to achieve a significant improvement on residual 
throughput (34%) on TCP uplink traffic. For both TCP 
downlink and uplink Softspeak mostly maintains or sig- 
nificantly improves VoIP quality. For completeness, Fig- 
ure 13 presents the corresponding results when all clients 
use a 20-ms G.729 codec. As expected, Softspeak deliv- 
ers less benefit in terms of throughput increase, yet re- 
mains critical for uplink VoIP call quality. 


4.5 Softspeak and Web traffic 


So far we have focused on Softspeak’s impact on bulk 
traffic, without other traffic present. In reality, of course, 
one may expect a diverse traffic mix. We next evaluate 
how our results change in the presence of Web traffic, by 
running an equal number of VoIP clients and Web clients 
in combination with a bulk TCP stream, where each of 
the Web clients repeatedly downloads the front page of 
cnn.com (630 KB). Note that the size of our testbed 
limits us to five VoIP clients and five Web clients, and 
the magnitude of improvement is expected to be smaller 
than for a larger number of clients. In Figure 14, we plot 
Softspeak’s improvements before (a and b) and after (c 
and d) adding Web traffic. Comparing the two scenarios 
we find that, independent of the presence of Web traf- 
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fic, Softspeak (a) raises uplink MOS to an identical level, 
(b) roughly maintains downlink MOS, and (c) improves 
downlink TCP throughput to the same degree (roughly 
35%). However, we also find that the gains made by Soft- 
speak on TCP uplink throughput diminish in the presence 
of Web traffic. In summary, it appears that, with the ex- 
ception of TCP uplink throughput, Softspeak’s improve- 
ments on the efficiency of the network are maintained, 
even when Web traffic is present. 


5 Limitations and discussion 


The scalability of Softspeak is limited by the number 
of slots available for uplink TDMA, 1.e., ten clients in 
802.11b (given 10-ms inter-packet interval VoIP codecs). 
In 802.11g (non-protected mode) the number of clients 
can be raised to twenty by choosing 500-s TDMA slots 
(assuming a 48-Mbps sending rate). In addition, the 
number of available slots can be further doubled in the 
case that only 20-ms codecs are in use. 


Softspeak relies on clients overhearing each other’s 
VoIP communication to perform downlink aggregation. 
Therefore, if a WLAN uses a WiFi encryption protocol 
such as WPA2, downlink aggregation is no longer possi- 
ble. Uplink TDMA, on the other hand, is not affected by 
encryption. Protocols encrypted above the MAC layer, 
such as Skype, can continue to take advantage of Softs- 
peak’s downlink aggregation, as long as they allow some 
way of being detected as VoIP. 


Another consequence of downlink aggregation is that 
Softspeak places a station’s interface in promiscuous 
mode, raising concerns of increased power usage. Sta- 
tions engaging in VoIP traffic cannot currently benefit 
from 802.11 power saving mode (PSM) with or without 
Softspeak enabled, since PSM’s duty cycling granular- 
ity is too coarse (a multiple of the beacon interval time). 
However, Softspeak introduces a well-defined schedule, 
both for uplink (TDMA) and downlink traffic (the ag- 
gregator’s schedule), even in the face of jitter caused 
by VoIP applications or the wide-area network. Future 
rapid-duty cycling hardware may be able to exploit Soft- 
speak to provide more fine-grained power savings. 


VoIP silence suppression may go some way towards 
mitigating the impact of VoIP, decreasing the need for 
Softspeak. However, it appears that silence suppres- 
sion is not universally implemented or supported by all 
codecs. For example, while monitoring a G.711 call be- 
tween a Linksys VoIP phone and a softphone (Twinkle), 
we observe no change to inter-packet time in traffic sent 
by either side, even when the sender is muted. The same 
applies when we monitor a SkypeOut call. On the other 
hand, we have observed that Skype-to-Skype calls do 
employ silence suppression by lowering the sending rate, 
rather than eliminating traffic completely. 


6 Related work 


Researchers have studied VoIP call quality in wireless 
networks and attempted to quantify how many VoIP calls 
traditional WiFi networks can handle while maintaining 
various quality-of-service (QoS) metrics. These range 
from analytical and simulation-based studies [3, 14, 22, 
25] to those that validate findings by measurements on 
actual experimental testbeds [4, 9, 20]. While precise 
findings vary, all studies agree that the effective VoIP ca- 
pacity of a WiFi network is less than one might expect 
given the bandwidth usage of typical VoIP streams. 

The poor performance of VoIP in WiFi networks 1s 
not protocol specific, but is symptomatic of a general 1s- 
sue with any CSMA (catrier-sense, multiple-access) net- 
work: channel access and arbitration becomes increas- 
ingly inefficient as load (in terms of number of attempted 
channel accesses) increases. TDMA can be far more ef- 
ficient under heavy load. Indeed, 802.11 includes both a 
point coordination function (PCF) mode and a hybrid co- 
ordination function (HCF) mode, in which the AP explic- 
itly arbitrates channel access. Unfortunately, very few 
deployed 802.11 networks employ these modes. 

If one considers modifying the hardware, a variety 
of options exist. For example, researchers have pro- 
posed modifying 802.11 PCF [3, 11] as well as alter- 
native ways of implementing 802.1 le-like functional- 
ity [22]. Of course, non-backwards compatible modifi- 
cations do not address the issue facing today’s networks. 
Accordingly, researchers have proposed a variety of ex- 
plicit time-slotting mechanisms, both within the context 
of infrastructure-based networks [10, 15, 16, 18, 21] and 
multi-hop mesh networks [13, 17]. 

MadMAC [18], ARGOS [13], and the Overlay MAC 
Layer (OML) [17] each propose to enable time-slotting 
on the order of 20 ms. Snow et al. [21] present a simi- 
lar TDMA-based approach to power savings where each 
slot is of the order of 100 ms and requires changes at 
the access points themselves. These scheduling granu- 
larities are too coarse to effectively support most VoIP 
codecs. While software TDMA (STDMA) [10] proposes 
to do TDMA for all traffic, they focus particularly on 
the performance of VoIP. Their approach is a substan- 
tial and backward-incompatible modification to 802.11 
that requires accurate clock synchronization. More sig- 
nificantly, each of the above schemes require the entire 
network to support the new TDMA architecture with no 
support for unmodified clients. 

Over and above TDMA mechanisms, the Soft- 
MAC [15] and MultiMAC [8] projects also suggest mod- 
ifications to 802.11 MAC behavior, including changing 
the ACK timing and modifying back-off parameters. The 
authors do not provide many details about their imple- 
mentations, however, nor do they evaluate their scheme 
with deadline-driven VoIP traffic. 
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Focusing explicitly on improving the performance of 
VoIP traffic in mixed-use networks, various proposals 
have suggesting prioritizing VoIP traffic [4, 25], no- 
tably a commercial product, Spectralink Voice Priority 
(SVP) [2]. SVP prioritizes downlink VoIP packets in 
the AP transmit queue and does not back-off when at- 
tempting VoIP transmissions. While we leverage similar 
optimizations, SVP does not do scheduling, thereby in- 
creasing collision rate due to the lack of back-off. 

Finally, several studies [12, 19] have shown using 
simulations that prioritizing traffic, using modified con- 
tention parameters, can lead to fairness and better re- 
source allocation in both uplink and downlink directions. 
In contrast to our work, these proposals aim only to bal- 
ance uplink and downlink traffic flows and do not evalu- 
ate TCP traffic in combination with VoIP traffic. 


7 Conclusion 


As WiFi-capable smartphone handsets become more 
popular, the number of simultaneous VoIP users is likely 
to increase dramatically in WiFi hotspots and enterprise 
networks. While previous work has aggregated downlink 
VoIP traffic, it has focused on improving VoIP call qual- 
ity in the face of competing best-effort traffic, but has ig- 
nored the impact of a large number of simultaneous VoIP 
sessions on the residual capacity of the network. 

We present Softspeak, a set of backward-compatible 
changes to WiFi that address contention and framing 
overhead. We show that our dynamic IFS contention 
scheme, combined with downlink aggregation, dramati- 
cally reduces the impact of VoIP on network capacity yet 
improves call quality. Our project page (including au- 
dio samples) is at http: //sysnet .ucsd.edu/wireless/ 
softspeak/. 
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Notes 


'TIn G.729 each direction has a 10-ms inter-packet arrival, an eight- 
byte voice payload, and twelve additional bytes of RTP header. Vari- 
ants of G.729 also run at longer inter-packet times and/or increased 
voice payload sizes. 

*We assume short preambles throughout the paper. 

3Note that delay in one direction affects MOS in both directions. 
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Abstract 


TCP has well-known problems over multi-hop wireless 
networks as it conflates congestion and loss, performs 
poorly over time-varying and lossy links, and is fragile 
in the presence of route changes and disconnections. 
Our contribution is a clean-slate design and implemen- 
tation of a wireless transport protocol, Hop, that uses re- 
liable per-hop block transfer as a building block. Hop is 
1) fast, because it eliminates many sources of overhead 
as well as noisy end-to-end rate control, 2) robust to par- 
titions and route changes because of hop-by-hop control 
as well as in-network caching, and 3) simple, because it 
obviates complex end-to-end rate control as well as com- 
plex interactions between the transport and link layers. 
Our experiments over a 20-node multi-hop mesh network 
show that Hop is dramatically more efficient, achieving 
better fairness, throughput, delay, and robustness to par- 
titions over several alternate protocols, including gains of 
more than an order of magnitude in median throughput. 


1 Introduction 


Wireless networks are ubiquitous, but traditional trans- 
port protocols perform poorly in wireless environments, 
especially in multi-hop scenarios. Many studies have 
shown that TCP, the universal transport protocol for re- 
liable transport, is ill-suited for multi-hop 802.11 net- 
works. There are three key reasons for this mismatch. 
First, multi-hop wireless networks exhibit a range of 
loss characteristics depending on node separation, chan- 
nel characteristics, external interference, and traffic load, 
whereas TCP performs well only under low loss condi- 
tions. Second, many emerging multi-hop wireless net- 
works such as long-distance wireless mesh networks, and 
delay-tolerant networks exhibit intermittent disconnec- 
tions or persistent partitions. TCP assumes a contem- 
poraneous end-to-end route to be available and breaks 
down in partitioned environments [13]. Third, TCP has 
well-known fairness issues due to interactions between 
its rate control mechanism and CSMA in 802.11, e.g., 


it is common for some flows to get completely shut out 
when many TCP/802.11 flows contend simultaneously 
[37]. Although many solutions (e.g. [16, 32, 38]) have 
been proposed to address parts of these problems, these 
have not gained much traction and TCP remains the dom- 
inant available alternative today. 

Our position is that a clean slate re-design of wireless 
transport necessitates re-thinking three fundamental de- 
sign assumptions in legacy transport protocols, namely 
that 1) a packet is the unit of reliable wireless transport, 
2) end-to-end rate control 1s the mechanism for dealing 
with congestion, and 3) a contemporaneous end-to-end 
route is available. The use of a small packet as the gran- 
ularity of data transfer results in increased overhead for 
acknowledgements, timeouts and retransmissions, espe- 
cially in high contention and loss conditions. End-to-end 
rate control severely hurts utilization as end-to-end loss 
and delay feedback is highly unpredictable in multi-hop 
wireless networks. The assumption of end-to-end route 
availability stalls TCP during periods of high contention 
and loss, as well as during intermittent disconnections. 

Our transport protocol, Hop, uses reliable per-hop 
block transfer as a building block, in direct contrast to 
the above assumptions. Hop makes three fundamen- 
tal changes to wireless transport. First, Hop replaces 
packets with blocks, 1.e., large segments of contiguous 
data. Blocks amortize many sources of overhead includ- 
ing retransmissions, timeouts, and control packets over 
a larger unit of transfer, thereby increasing overall uti- 
lization. Second, Hop does not slow down in response 
to erroneous end-to-end feedback. Instead, it uses hop- 
by-hop backpressure, which provides more explicit and 
simple feedback that is robust to fluctuating loss and de- 
lay. Third, Hop uses hop-by-hop reliability in addition to 
end-to-end reliability. Thus, Hop is tolerant to intermit- 
tent disconnections and can make progress even when 
a contemporaneous end-to-end route is never available, 
1.e., the network is always partitioned [3]. 

Large blocks introduce two challenges that Hop con- 
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verts into opportunities. First, end-to-end block retrans- 
missions are considerably more expensive than packet 
retransmissions. Hop ensures end-to-end reliability 
through a novel retransmission scheme called virtual re- 
transmissions. Hop routers cache large in-transit blocks. 
Upon an end-to-end timeout triggered by an outstand- 
ing block, a Hop sender sends a token corresponding to 
the block along portions of the route where the block is 
already cached, and only physically retransmits blocks 
along non-overlapping portions of the route where it is 
not cached. Second, large blocks as the unit of transmis- 
sion exacerbates hidden terminal situations. Hop uses a 
novel ack withholding mechanism that sequences block 
transfer across multiple senders transmitting to a single 
receiver. This lightweight scheme reduces collisions in 
hidden terminal scenarios while incurring no additional 
control overhead. 
In summary, our main contribution is to show that 
reliable per-hop block transfer is fundamentally better 
than the traditional end-to-end packet stream abstraction 
through the design, implementation, and evaluation of 
Hop. The individual components of Hop’s design are 
simple and perhaps right out of an undergraduate net- 
working textbook, but they provide dramatic improve- 
ments in combination. In comparison to the best variant 
of 1) TCP, 2) Hop-by-hop TCP, and 3) DTN 2.5, a delay 
tolerant transport protocol [8], 
> Hop achieves a median goodput benefit of 1.6 and 
2.3x over single- and multi-hop paths respectively. 
The corresponding lower quartile gains are 28 x and 
2.7 x showing that Hop degrades gracefully. 

> Under high load, Hop achieves over an order of 
magnitude benefit in median goodput (e.g., 90x 
over TCP with 30 concurrent large flows), while 
achieving comparable or better aggregate goodput 
and transfer delay for large as well as small files. 

> Hop is robust to partitions, and maintains its perfor- 

mance gains in well-connected WLANs and mesh 
networks as well as disruption-prone networks. Hop 
also co-exists well with delay-sensitive VoIP traffic. 


2 Why reliable per-hop block transfer? 


In this section, we give some elementary arguments 
for why reliable per-hop block transfer with hop-by- 
hop flow control is better than TCP’s end-to-end packet 
stream with end-to-end rate control in wireless networks. 


Block vs. packet: A major source of inefficiency 
is transport layer per-packet overhead for timeouts, ac- 
knowledgements and retransmissions. These overheads 
are low in networks with low contention and loss but in- 
crease significantly as wireless contention and loss rates 
increase. Transferring data in blocks as opposed to pack- 
ets provides two key benefits. First, it amortizes the over- 
head of each control packet over larger number of data 
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packets. This allows us to use additional control packets, 
for example, to exploit in-network caching, which would 
be prohibitively expensive at the granularity of a packet. 
Second, it enables transport to leverage link-layer tech- 
niques such as 802.11 burst transfer capability [1], whose 
benefits increase with large blocks. 


Transport vs. link-layer reliability: © Wireless chan- 
nels can be lossy with extremely high raw channel loss 
rates in high interference conditions. In such networks, 
the end-to-end delivery rate decreases exponentially with 
the number of hops along the path, severely degrading 
TCP throughput. The state-of-the-art response today is 
to use a sufficiently large number of 802.11 link-layer 
acknowledgements (ARQ) to provide a reliable channel 
abstraction to TCP. However, 802.11 ARQ 1) interacts 
poorly with TCP end-to-end rate control as it increases 
RTT variance, 2) increases per-packet overhead due to 
more carrier sensing, backoffs, and acknowledgments, 
especially under high contention and loss (in 85.1.1, 
we show that 802.11b ARQ has 35% overhead). Note 
that TCP’s woes cannot be addressed by just setting the 
802.11 ARQ limit to a large value as it would reduce the 
overall throughput by disproportionately using the chan- 
nel for transmitting packets over bad links. Unlike TCP, 
Hop relies solely on transport-layer reliability and avoids 
link-layer retransmissions for data, thereby avoiding neg- 
ative interactions between the link and transport layers. 


Hop-by-hop vs. end-to-end congestion control: Rate 
control in TCP occurs in response to end-to-end loss and 
delay feedback reported by each packet. However, end- 
to-end feedback is error-prone and has high variance in 
multi-hop wireless networks as each packet observes sig- 
nificantly different wireless interference across different 
contention domains along the route. This variance hurts 
TCP’s utilization as: 1) its window size shrinks conserva- 
tively in response to loss, and 2) it experiences frequent 
retransmission timeouts when no data is sent. 

Our position is that fixing TCP’s rate control algorithm 
in environments with high variability is fundamentally 
difficult. Instead, we circumvent end-to-end rate control, 
and replace it with hop-by-hop backpressure. Our ap- 
proach has two key benefits: 1) hop-by-hop feedback is 
more robust than end-to-end feedback as it involves only 
a single contention domain, and 2) block-level feedback 
provides an aggregated link quality estimate that has less 
variability than packet-level feedback. 


In-network caching: The use of reliable per-hop 
block transfer enables us to exploit caching at interme- 
diate hops for two benefits. First, caching obviates re- 
dundant retransmissions along previously traversed seg- 
ments of a route. Second, caching 1s more robust to inter- 
mittent disconnections as it enables progress even when 
a contemporaneous end-to-end route is unavailable. Hop 
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Figure 1; Structure of a block. 


can also use secondary storage if needed in partitionable 
networks with long disconnections and reconnections. 


3 Design 


This section describes the Hop protocol in detail. Hop’s 
design consists of six main components: 1) reliable per- 
hop block transfer, 2) virtual retransmissions for end- 
to-end reliability, 3) backpressure congestion control, 4) 
handling routing partitions, 5) ack withholding to handle 
hidden terminals, and 6) a per-node packet scheduler. 


3.1 Reliable per-hop block transfer 


The unit of reliable transmission in Hop is a block, 1.e., 
a large segment of contiguous data. A block comprises 
a number of txops (the unit of a link layer burst), which 
in turn consists of a number of frames (Figure 1). The 
protocol proceeds in rounds until a block B is success- 
fully transmitted. In round 2, the transport layer sends 
a BSYN packet to the next-hop requesting an acknowl- 
edgment for B. Upon receipt of the BSYN, the receiver 
transmits a bitmap acknowledgement, BACK, with bits 
set for packets in B that have been correctly received. In 
response to the BACK, the sender transmits packets from 
B that are missing at the receiver. This procedure repeats 
until the block is correctly received at the receiver. 


Control Overhead: Hop requires minimal control 
overhead to transmit a block. At the link layer, Hop dis- 
ables acknowledgements for all data frames, and only en- 
ables them to send control packets: BSYN and BACK. 
At the transport layer, a BACK acknowledges data in 
large chunks rather than in single packets. The reduced 
number of acknowledgement packets is shown in Fig- 
ure 2, which contrasts the timeline for a TCP packet 
transmission alongside a block transfer in Hop. For large 
blocks (e.g. 1 MB), Hop requires orders of magnitude 
fewer acknowledgements than for an equivalent number 
of packets using TCP with link-layer acknowledgements. 
In addition, Hop reduces idle time by ensuring that pack- 
ets do not wait for link-layer ACKs, and at the transport 
layer by disabling rate control. Thus, Hop nearly always 
sends data at a rate close to the link capacity. 


Spatial Pipelining: The use of large blocks and hop- 
by-hop reliability can hurt spatial pipelining since each 
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Figure 2: Timeline of TCP/802.11 vs. Hop 


node waits for the successful reception of a block be- 
fore forwarding it. To improve pipelining, an intermedi- 
ate hop forwards packets as soon as it receives at least a 
txop worth of new packets instead of waiting for an en- 
tire block. Thus, Hop leverages spatial pipelining as well 
as the benefits of burst transfer at the link layer. 


3.2 Ensuring end-to-end reliability 


Hop-by-hop reliability is insufficient to ensure reliable 
end-to-end transmission. A block may be dropped if 1) 
an intermediate node fails in the middle of transmitting 
the block to the next-hop, or 2) the block exceeds its TTL 
limit, or 3) a cached block eventually expires because no 
next-hop node is available for a long duration. 

Hop uses virtual retransmissions together with in- 
network caching to limit the overhead of retransmitting 
large blocks. Hop routers store all packets that they over- 
hear. Thus, a re-transmitted block is likely cached at 
nodes along the original route until the point of failure or 
drop, and might be partially cached at a node that is along 
a new path to the destination but overheard packets trans- 
mitted on the old path. Hence, instead of retransmitting 
the entire block, the sender sends a virtual retransmis- 
sion, 1.€., a special BS YN packet, using the same hop-by- 
hop reliable transfer mechanism as for a block. Virtual 
retransmissions exploit caching at intermediate nodes by 
only transmitting the block (or parts of the block) when 
the next hop along the route does not already have the 
block cached as shown in Figure 3. 

A premature timeout in TCP incurs a high cost both 
due to redundant transmission as well as its detrimental 
rate control consequence, so a careful estimation of time- 
out is necessary. In contrast, virtual retransmissions due 
to premature timeouts do little harm, so Hop simply uses 
the most recent round-trip time as its timeout estimate. 
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Figure 4: Example showing need for backpressure. Without 
backpressure, Node A would allocate 1/5th of out-going ca- 
pacity to each flow, resulting in queues increasing unbounded 
at nodes B through E. With backpressure, most data is sent to 
node F, thereby increasing utilization. 


3.3. Backpressure congestion control 


Rate control in response to congestion 1s critical in TCP 
to prevent congestion collapse and improve utilization. 
In wireless networks, congestion collapse can occur both 
due to increased packet loss due to contention [11], and 
increased loss due to buffer drops [9]. Both cases result 
in wasted work, where a packet traverses several hops 
only to be dropped before reaching the destination. Prior 
work has observed that end-to-end loss and delay feed- 
back has high variance and is difficult to interpret unam- 
biguously in wireless networks, which complicates the 
design of congestion control [2, 32]. 

Hop relies only on hop-by-hop backpressure to avoid 
congestion. For each flow, a Hop node monitors the dif- 
ference between the number of blocks received and the 
number reliably transmitted to its next-hop as shown in 
Figure 4. Hop limits this difference to a small fixed 
value, H, and implements it with no additional over- 
head to the BSYN/BACK exchange. After receiving 
complete blocks, a Hop node does not respond to fur- 
ther BSYN requests from an upstream node until it has 
moved at least one more block to its downstream node. 
The default value of H is set to 1 block. 

Backpressure in Hop significantly improves utiliza- 
tion. To appreciate why, consider the following scenario 
where flows 1,..., all share the first link with a low 
loss rate. Assume that the rest of flow 1’s route has 
a similar low loss rate, while flows 2,...,(& — 1) tra- 
verse a poor route or are partitioned from their destina- 
tions. Let C’ be the link capacity, p; be the end-to-end 
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loss observed by the first flow, and p2 be the end-to- 
end loss rate observed by other flows (pi < p2). With- 
out backpressure, Hop would allocate a 1/k fraction of 
link capacity to each flow, yielding a total goodput of 
C AG pu) tga (hot) And the number of buffered 
blocks at the next-hops of the latter & — 1 flows grows 
unbounded. On the other hand, limiting the number of 
buffered blocks for each flow yields a goodput close to 
C'- (1 — p;) in this example. 

Why does Hop limit the number of buffered blocks, H, 
to a small default value? Note that the example above can 
be addressed simply by choosing the block correspond- 
ing to the flow with the largest differential backlog (along 
A-F). Indeed, classical backpressure algorithms known 
to achieve optimal throughput [33] work similarly. Hop 
limits the number of buffered blocks to a small value in 
order to ensure small transfer delay for finite-sized files, 
as well as to limit intra-path contention. 


3.4 Robustness to partitions 


A fundamental benefit of Hop is that it continues to make 
progress even when the network is intermittently parti- 
tioned. Hop transfers a blocks in a hop-by-hop manner 
without waiting for end-to-end feedback. Thus, even if 
an end-to-end route is currently unavailable, Hop contin- 
ues to make progress along other hops. 

The ability to make progress during partitions relies 
on knowing which next-hop to use. Unlike typical mesh 
routing protocols [23, 4], routing protocols designed for 
disruption-tolerance expose next-hop information even 
if an end-to-end route is unavailable (e.g. RAPID [3], 
DTLSR [7]). In conjunction with such a disruption- 
tolerant routing protocol, Hop can accomplish data trans- 
fer even if a contemporaneous end-to-end route is never 
available, i.e., the network is always partitioned. 

In disruption-prone networks, a Hop node may need 
to cache blocks for a longer duration in order to make 
progress upon reconnection. In this case, the backpres- 
sure limit needs to be set taking into account the fraction 
of time a node is partitioned and the expected length of 
a connection opportunity with a next-hop node along a 
route to the destination (see 85.7 for an example). 


3.5 Handling hidden terminals 


The elimination of control overhead for block transfer 
improves efficiency but has an undesirable side-effect — 
it exacerbates loss in hidden terminal situations. Hop 
transmits blocks without rate control or link-layer re- 
transmissions, which can result in a continuous stream 
of collisions at a receiver if the senders are hidden from 
each other. While hidden terminals are a problem even 
for TCP, rate control mitigates its impact on overall 
throughput. Flows that collide at a receiver observe in- 
creased loss and throttle their rate. Since different flows 
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get different perceptions of loss, some reduce their rate 
more aggressively than others, resulting in most flows 
being completely shut out and bandwidth being devoted 
to one or few flows [37]. Thus, TCP is highly unfair but 
has good aggregate throughput. 

Hop uses a novel ack withholding technique to mit- 
igate the impact of hidden terminals. Here, a receiver 
acknowledges only one BSYN packet at any time, and 
withholds acknowledgement to other concurrent BSYN 
packets until the outstanding block has completed. In this 
manner, the receiver ensures that it is only receiving one 
block from any sender at a given time, and other senders 
wait their turn. Once the block has completed, the re- 
ceiver transmits the BACK to one of the other transmit- 
ters, which starts transmitting its block. 

Although ack withholding does not address hidden 
terminals caused by flows to different receivers, it of- 
fers a lightweight alternative to expensive and conser- 
vative techniques like RTS/CTS for the common single- 
terminal hidden terminal case. The high overhead of 
RTS/CTS arises from the additional control packets, es- 
pecially since these are broadcast packets that are trans- 
mitted at the lowest bit-rate. The use of broadcast also 
makes RTS/CTS more conservative since a larger con- 
tention region is cleared than typically required [39]. In 
contrast, ack withholding requires no additional control 
packets (as BSYNs and BACKs are already in place for 
block transfer). 


3.6 Packet scheduling 


Hop’s unit of link layer transmission is a txop, which is 
the maximum duration for which the network interface 
card (NIC) is permitted to send packets in a burst without 
contending for access [1]. Hop’s scheduler leverages the 
burst mode and sends a txop’s worth of data from each 
concurrent flow at a time in a round-robin manner. 

Hop traffic is isolated from delay-sensitive traffic 
such as VoIP or video by using link-layer prioritiza- 
tion. 802.11 chipsets support four priority queues— 
voice, video, best-effort, and background in decreasing 
order of priority—with the higher priority queues also 
having smaller contention windows [1]. Hop traffic is 
sent using the lowest priority background queue to mini- 
mize impact on delay-sensitive datagrams. 

The design choices that we have presented so far can 
be detrimental to delay for small files (referred to as 
micro-blocks) in three ways: 1) the initial BS YN/BACK 
exchange increases delay for micro-blocks, 2) a sender 
may be servicing multiple flows, in which case a micro- 
block may need to wait for multiple txops, and 3) ack- 
withholding can result in micro-blocks being delayed by 
one or more large blocks that are acknowledged before 
its turn. Hop employs three techniques to optimize delay 
for micro-blocks. First, micro-blocks of size less than a 


fixed BSYN batch threshold (few tens of KB) are sent 
piggybacked with the BSYN with link-layer ARQ via 
the voice queue. This optimization eliminates the ini- 
tial BSYN/BACK delay, and avoids having to wait for 
a BACK before proceeding, thereby circumventing ack- 
withholding delay. Second, the packet scheduler at the 
sender prioritizes micro-blocks over larger blocks. Fi- 
nally, Hop use a block-size based ack-withholding policy 
that prioritizes micro-blocks over larger blocks. 


4 Implementation 


We have implemented a prototype of Hop with all the 
features described in 83. Hop is implemented in Linux 
2.6 as an event-based user-space daemon in roughly 5100 
lines of C code. Hop is currently implemented on top 
of UDP (..e., there is a UDP header in between the IP 
and Hop headers in each frame in Figure 1). Below, we 
describe important aspects of Hop’s implementation. 


4.1 MAC parameters 


Our implementation uses the Atheros-based wireless 
chipset and the Madwifi open source 802.11 device 
driver [18], a popular commodity implementation. By 
default, the MadWifi driver (as well as other commodity 
implementations) supports the 802.1le QoS extension. 
However, MadWiFi supports the extension only in the 
access point mode, so we modify the driver to enable it 
in the ad-hoc mode as well. Hop uses default 802.11 
settings, except for the following. The transmission op- 
portunity (txop) for the background queue is set to the 
maximum value permitted by the MadWifi driver (8160 
ys or roughly 8KB of data). Link-layer ARQ is disabled 
for all data frames sent via Hop but enabled for control 
packets (BSYN, BACK, etc). 


4.2 Hop implementation 


Parameters A large block size increases batching ben- 
efits [15], so we set the default maximum block size to 
1MB. Note that this means that a Hop block is allowed to 
be up to 1MB in size, but may be any smaller size. Hop 
never waits idly in anticipation of more application data 
in order to obtain batching benefits. The BSYN batch 
threshold for micro-blocks is set to a default value of 
16KB, and the backpressure limit, H, is set to 1. The 
virtual retransmission timeout is set to an initial value of 
60 seconds and simply reset to the round-trip block delay 
reported by the most recent block. The TTL limit for a 
virtual retransmissions is set to 50 hops. In the current 
implementation, an intermediate Hop node keeps all the 
blocks that it has received in memory. 


Header format: The Hop header consists of the fol- 
lowing fields. All frames contain the msg_type that 
identifies if the frame is a data, BSYN, BACK, virtual 
retransmission BSYN, or an end-to-end BACK frame; 
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the £1 ow_id that uniquely identifies an end-to-end Hop 
connection; and the block_num identifies the current 
block. Data frames also contain the packet_num that 
is the offset of the packet in the current block. The 
packet_num Is also used to index into the bitmap re- 
turned in a BACK frame. 


End-to-end connection management: Because Hop 
is designed to work in partitionable networks, it does not 
use a three-way handshake like TCP to initiate a connec- 
tion. A destination node sets up connection state upon 
receiving the first block. The loss of the first block due 
to a node failure or expiry or the loss of the first end- 
to-end BACK is handled naturally by virtual retransmis- 
sions. In our current implementation, a Hop node tears 
down a connection simply by sending a FIN message and 
recovering state; we have not yet implemented optimiza- 
tions to handle complex failure scenarios. 


5 Evaluation 


We evaluate the performance of Hop in a 20-node wire- 
less mesh testbed. Each node is an Apple Mac Mini 
computer running Linux 2.6 with a 1.6 Ghz CPU, 2 GB 
RAM and a built-in 802.1 1a/b/g Atheros/Mad WiFi wire- 
less card. Each node is also connected via an Ethernet 
port to a wired backplane for debugging, testing, and data 
collection. The nodes are spread across a single floor of 
the UMass CS building as shown in Figure 5. 

All experiments, except those in 85.9 and 85.10, were 
run in 802.11b mode with bit-rate locked at 11 Mbps. 
There is significant inherent variability in wireless con- 
ditions, so in order to perform a meaningful comparison, 
a single graph is generated by running the corresponding 
experiments back-to-back interspersed with a short ran- 
dom delay. The compared protocols are run in sequence, 
and each sequence is repeated many times to obtain con- 
fidence bounds. 

We compare Hop against two classes of protocols: 
end-to-end and hop-by-hop. The former consists of 1) 
UDP, and 2) the default TCP implementation in Linux 
2.6 with CUBIC congestion control [10]; we did not use 
the Westwood+ congestion control algorithm since it per- 
formed roughly 10% worse. The latter consists of 3) 
Hop-by-Hop TCP, and 4) DTN2.5 [8]. Hop-by-Hop TCP 
is our implementation of TCP with backpressure. It splits 
a multi-hop TCP connection into multiple one-hop TCP 
connections, and leverages TCP flow control to achieve 
hop-by-hop backpressure. Each node maintains one out- 
going TCP socket and one incoming TCP socket for each 
flow. When the outgoing socket is full, Hop-by-Hop TCP 
stops reading from the incoming socket, thereby forcing 
TCP’s flow control to pause the previous hop’s outgoing 
socket. This “backpressure” propagates up to the source 
and forces the source to slow down. DTN2.5 is a ref- 
erence implementation of the IEEE RFC 4838 and 5050 
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Figure 5: Experimental testbed with dots representing nodes. 


from the Delay Tolerant Networking Research Group [8] 
that reliably transfers a bundle using TCP at each hop. 
Hop and UDP were set to use the same default packet 
size as TCP (1.5KB). In all our experiments, the delay 
and goodput of TCP are measured after subtracting con- 
nection setup time. 


5.1 Single-hop microbenchmarks 


In this section, we answer two questions: 1) What are 
the best 802.11 settings for link layer acknowledgments 
(ARQ) and burst mode (txop) for TCP and UDP?, 2) 
How does Hop’s performance compare to that of TCP 
and UDP given the benefit of these best-case settings? 
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Figure 6: Experiment with one-hop flows. Hop improves lower 
quartile goodput by 28x, median goodput by 1.6, and mean 
goodput by 1.6 over TCP with the best link layer settings. 
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lower/median/upper quartile, lines show max/min, and dot 
shows mean. Increasing 802.11 ARQ limit and using txops 
helps TCP but Hop is still considerably better. UDP results 
show that ARQs incur significant performance overhead (35%). 
Hop is within 24% of UDP without ARQ (achievable goodput). 
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Result Median (Mean) 


One single-hop flow Hop vs. TCP 1.6x (1.6x) 

85.2 One multi-hop flow Hop vs. TCP 2.3x (2x) 
Hop vs. Hop-by-Hop TCP 2.5x (2x) 

al Hop vs. DTN2.5 2.9x (3.9x) 


§5.3 Many multi-hop flows Hop vs. TCP 90x (1.25 x) 
Hop vs. Hop-by-Hop TCP 20 x (1.4x) 


§5.4 


Performance breakdown Base Hop 
+ ack withholding 
+ backpressure 


+ ack withholding + backpressure 


USENIX Association 


Hop vs. TCP 


WLAN AP mode 


85.6 
Concurrent small files 


Hop vs. TCP + RTS/CTS 


Single small file Hop vs. TCP 3x to 15x lower delay 
Hop vs. TCP Comparable or lower delay 


2.7x (1.12x) 
2x (1.4x) 


Hop vs. DTN23 2.8% 29%) 


Impact on VoIP traffic Hop vs. TCP 


Hop vs. TCP + OLSR 
Hop vs. TCP + auto-rate 
Hop vs. TCP + OLSR + auto-rate 


§5.9 


85.10 | Under 802.11lg 


Network and link-layer dynamics 


Hop vs. TCP 





Hop vs. TCP + auto-rate 


28K 29x) 
Slightly lower MOS score but sig- 
4x (1x) 
95x (2.4x) 


5x (1.8x) 


22x (1x) 
6x (3x) 


Table 1: Summary of evaluation results. All protocols above are given the benefit of burst-mode (txop) and the maximum number 
of link-layer retransmissions (max-ARQ) supported by the hardware. 


5.1.1. Randomly picked links 


In this experiment, we evaluate the single-hop perfor- 
mance of TCP, UDP, and Hop over 802.11 across links 
in our mesh testbed. The testbed has total of 56 unique 
links from which a random sequence of 100 links was 
sampled with repetition for this experiment. The average 
and median loss rates were 25% and 1% respectively. For 
each sampled link, a [OMB file is transferred using each 
protocol; for bad links, flows were cut off at 10 minutes, 
and goodput measured until the last received packet. The 
metric for comparison is the goodput that is measured as 
the total number of unique packets received at the re- 
ceiver divided by the time until the last byte is received. 

We compare Hop against TCP for three 802.11 set- 
tings: 1) 11 link layer retries (ARQ) with no txop, the 
default settings of the MadWifi driver, 2) 11 ARQ + 
txop, and 3) maximum permitted ARQ setting (18 for 
the Atheros card) + txop. We do not consider TCP with 
no ARQ since it (expectedly) performs poorly without 
802.11 retransmissions on lossy links. We also compare 
against UDP under different 802.11 settings. Since UDP 
has no transport-layer control overhead, and transmits as 
fast as the card can transmit packets, it provides an up- 
per bound on the achievable capacity on the link. For 
clarity of presentation, we show cumulative distributions 
(CDFs) for Hop and the best TCP combination and sum- 
mary statistics for the other combinations (for which full 
distributions are available in [15]). 

Figure 6 shows that Hop significantly outperforms 


TCP/max-ARQ/txop, the best TCP combination. The 
Q1, Q2, and Q3 gains over TCP/max-ARQ/txop TCP 
combination are 28x, 1.6x, and 1.2 respectively. The 
Q1 gain is notable and shows Hop’s robust performance 
on poor links compared to TCP. 

Figure 7 shows the summary statistics for Hop and two 
best TCP and UDP schemes using a box plot represen- 
tation. The “box” shows the upper quartile (Q3), me- 
dian (Q2) and lower quartile (Q1), and the “whiskers” 
show the maximum and minimum goodput. UDP/no- 
ARQ/txop is the best UDP combination and provides an 
upper bound on the achievable rate. The median Hop 
is about 24% lower than the achievable rate. Interest- 
ingly, turning on ARQ degrades UDP by 35% showing 
that ARQ in 802.11 comes at a high overhead and ARQ 
alone is not sufficient to fix TCP’s problems. 


As we find that TCP performance consistently im- 
proves by using txops and ARQ with the maximum 


possible limit, we give TCP and its variants the ben- 
efit of txtop/max-ARQ in the rest of our evaluation. 





5.1.2 Graceful performance degradation 


A key benefit of Hop is robustness, i.e., its performance 
gracefully degrades with increasing link losses and in- 
terference. To confirm this, we further analyze the data 
from the experiment in 85.1.1. Figure 8(a) shows the per- 
link throughput across the 56 links in the testbed (with 
multiple runs over the same link averaged) sorted by TCP 
goodput. Hop degrades gracefully to some of the poorest 
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Figure 8: Graceful degradation to adverse channel conditions. 
First plot shows per-link goodputs from one-hop experiment 
sorted in TCP order. Second plot shows controlled experiments 
demonstrating impact of loss. In both cases, Hop is more ro- 
bust and degrades far more gracefully than TCP. 


links in the testbed where TCP’s throughput is near-zero. 
The average goodput for the worst 20 TCP flows is 334 
Kbps, whereas Hop’s goodput for the same flows 1s 2.37 
Mbps, a difference of 7x. 

To understand the cause of TCP’s fragile behavior, 
we evaluate the impact of loss perceived at the trans- 
port layer on the performance of Hop and TCP. We start 
with a perfect link that has a near-zero loss rate and in- 
troduce loss by modifying the MadWifi device driver to 
randomly drop a specified fraction of incoming pack- 
ets. Figure 8(b) shows that, unsurprisingly, TCP goodput 
drops to near-zero when loss rate is roughly 20%. Hop 
shows graceful near-linear degradation and is operational 
until the loss rate is about 80%. 


5.2 Miulti-hop microbenchmarks 


How does Hop perform on multi-hop paths compared to 
existing alternatives? To study this question, we pick 
a sequence of 100 node pairs randomly with repetition 
from the testbed. Static routes are set up a priori between 
all node pairs to isolate the impact of route flux (consid- 
ered in 85.3). The static routes were obtained by run- 
ning OLSR with the default ETX metric until the routing 
topology stabilized at the beginning of the experiment. 
Among the 100 randomly chosen flows, 30% are two- 
hop, 30% are three-hop, 10% are four-hop, 20% are five- 
hop, and the remaining 10% are seven-hop flows. We 
compare the multi-hop goodput of Hop to TCP, Hop-by- 
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Figure 9: Experiment with multi-hop flows. Hop improves 
lower quartile goodput by 2.7, median goodput by 2.3 x, and 
mean goodput by 2x. 
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Figure 10: Boxplot of multi-hop single-flow benchmarks. Hop 
has 2-3 median, and 2-4 mean improvements over other 
reliable transport protocols. Hop is comparable to UDP/no- 
ARQ/txop in terms of median/mean — the latter is extremely 
fast since it has no overhead, but experiences more loss. 





Hop TCP, DTN2.5, and UDP. 

Figure 9 shows the CDF of goodput for just Hop and 
TCP, while Figure 10 shows the summary statistics for 
all the protocols. Hop consistently outperforms all other 
protocols. The Q1, Q2, and Q3 gains over TCP are 
2.7x, 2.3x and 1.9x respectively. The QI gain over 
TCP is lower than for the single-hop experiment be- 
cause only good links selected by OLSR are used in this 
experiment (as evidenced by the better performance of 
UDP/no-ARQ/txop compared to UDP/max-ARQ/txop). 
Over lossier paths, Hop’s gains are much higher. We 
also find that the gains also grow with increasing num- 
ber of hops (refer technical report [15]). For example, 
the lower quartile gains grow from about 2.7 for two 
hops to more than 4x for five and six hops. 


5.3. Hop under high load 


The experiments so far considered one flow in isolation. 
Next, we evaluate Hop in a heavily loaded network to un- 
derstand the effect of increased contention and collisions 
on Hop’s performance and fairness. We compare Hop, 
TCP, and Hop-by-Hop TCP. The experiment consists of 
thirty concurrent flows that transfer data continually be- 
tween randomly chosen node pairs in the testbed. All 
protocols are run over a static mesh topology identical to 
85.2. To focus on multihop benefits, we pick src-dst pairs 
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Figure 11: Hop for 30 concurrent flows. Dots on each line 


shows mean goodput. Median gains of Hop over Hop-by-Hop 
TCP and regular TCP are huge (20x and 90 respectively) 
while mean gains are modest (roughly 25% improvement). 


that are not immediate neighbors of each other. We run 
the experiment five times, and for each run, we measure 
the goodputs of flows half an hour into the experiment, 
since the network reaches a steady state at this time. 


5.3.1 Goodput 


Figure 11 shows that Hop achieves a huge improvement 
in median goodput over TCP and Hop-by-Hop TCP. Hop 
achieves a median goodput of 54.9 Kbps whereas all the 
other protocols achieve less than 2.8 Kbps—an improve- 
ment of over an order of magnitude! Hop also improves 
the Q1 goodput by more than two orders of magnitude 
and upper quartile goodput by 2x over the other proto- 
cols. The exact numbers of Hop’s median and Q1 gains 
over other protocols are sensitive to environmental con- 
ditions, but we consistently observe them to be large un- 
der different conditions. The figure also shows that Hop- 
by-Hop TCP achieves more than 4x improvement over 
TCP’s median goodput. This shows that end-to-end rate 
control hurts TCP utilization and using hop-by-hop back- 
pressure with TCP improves its performance. We also 
run UDP (not shown for clarity), but due to lack of con- 
gestion control, around 67% flows get zero goodput (i.e., 
the median is zero) and the mean goodput is 0.32Kbps. 
Hop’s mean gain over TCP is just 25%, which is not 
as impressive as the quartile gains. This is to be expected 
as TCP is highly unfair and starves a large number of 
flows to acquire the channel for only a few flows. In 
many cases, the top three TCP flows get around 90% of 
the total goodput. In contrast, Hop 1s significantly fairer 
and has higher throughput than most of the TCP flows. 


0-78 (0.09) 


0.12 0.04) 
Hop-by-Hop TCP | _ 0.21 (0.05) 


Table 2: Fairness indexes for the 30 flow experiment. Paren- 
theses show 95% confidence intervals. 





5.3.2 Fairness 


Table 2 shows the fairness index for different protocols. 
The fairness metric that we use is hop-weighted Jain’s 
fairness index (JFI [28]). When there are n flows, with 
throughput x; through zx, and hop lengths h; through 
hy, itis computed as follows: JFI = ieee 

Hop is significantly fairer than both TCP-based proto- 
cols. It is noteworthy that while TCP sacrifices fairness 
for goodput, Hop is superior on both metrics. 


5.4 Hop performance breakdown 


How much do components of Hop individually con- 
tribute to its overall performance? To answer this ques- 
tion, we compare four versions of Hop: 1) the basic Hop 
protocol that only uses hop-by-hop block transfer, 2) Hop 
with ack withholding turned on, 3) Hop with backpres- 
sure turned on, and 4) Hop with both ack withholding 
and backpressure turned on. Since the impact of these 
mechanisms depends on the load in the network, we con- 
sider 10, 20 and 30 concurrent flows between randomly 
picked sender-receiver node pairs. A static mesh topol- 
ogy identical to 85.2 was used. The length of the ran- 
domly picked paths are between three and seven hops. 
The average path length is 3.9 hops in the 10 flow case, 
4 hops in the 20 flow case, and 3.9 hops in the 30 flow 
case. Each flow transmits a large amount of data, and we 
take a snapshot of the measurements after half an hour. 
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Figure 12: Hop performance breakdown showing contribution 
of ack withholding and backpressure. Ack withholding and 
backpressure improve Hop’s performance by more than 4.8x 
under high load. 


Figure 12 shows the performance of the different 
schemes. The benefit of ack withholding and backpres- 
sure increases with network load. In the 10 flow case, 
both ack withholding and backpressure increase goodput 
by around 20%. With greater network load, congestion 
increases dramatically, hence the gains due to backpres- 
sure is more than due to ack withholding. For exam- 
ple, in the 30 flow case, Hop with backpressure yields 
3.7X improvement over basic Hop, whereas Hop with 
ack withholding yields 2.5 x improvement. Furthermore, 
the benefits of using both backpressure and ack withhold- 
ing are considerably more than using either one of them. 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 


431 


432 


For instance, the full-fledged Hop yields 4.8 improve- 
ment over basic Hop for the 30 flow case. 


5.5 Hop with WLAN access points 


Next, we evaluate how ack withholding in Hop compares 
to the 802.11 RTS/CTS mechanism for dealing with hid- 
den terminals. We emulate a typical one-hop WiFi net- 
work where a number of terminals connect to a single 
access point. We setup a 7-to-1 topology for this experi- 
ment, by selecting a node in the center of our testbed to 
act as the “AP node’, and transmitting data to this node 
from all its seven neighbors. Among the seven transmit- 
ters, six pairs were hidden terminals (i.e. they could not 
reach each other but could reach the AP). We verified 
this by checking to see if they could transmit simultane- 
ously without degradation of throughput. In each run, the 
nodes transmit data continually, and we measure goodput 
after 30 minutes when the flow rates have stabilized. 


Median 
663 (24) 652 (33) 0.93 (0.01) 


TCP + RTS/CTS 





Table 3: Mean/median goodput and Fairness for a many-to-one 
“AP” setting. 95% confidence intervals shown in parenthesis 


We compare Hop against TCP both with and without 
802.11 RTS/CTS enabled. The results are presented in 
Table 3, and show that Hop beats TCP with or with- 
out RTS/CTS both in throughput and fairness. While 
the mean gains over TCP without RTS/CTS are only 
12%, the median improvement is about 2.7x. TCP 
has a crafty way of maintaining high aggregate good- 
put amidst hidden terminals by squelching all but one 
of the flows and in effect serializing them. In contrast, 
Hop achieves almost perfectly fair allocation across the 
different flows. The addition of RTS/CTS to TCP hurts 
aggregate throughput but improves median throughput 
and fairness. However, Hop achieves 1.4 the aggregate 
throughput, 1.96 the median throughput, in addition to 
hugely improving fairness over TCP with RTS/CTS. 


5.6 Hop delay for small file transfers 


How does Hop impact the delay incurred by micro- 
blocks (small files)? Recall that Hop uses two mecha- 
nisms to speed micro-block transfers: 1) It piggybacks 
micro-blocks less than 16KB in size with the initial 
BSYN to reduce connection setup overhead, 2) It’s ack 
withholding mechanism prioritizes micro-blocks. 


5.6.1 Single-hop transfer delay for small files 


First, we evaluate the benefits of Hop’s size-aware ack 
withholding policy. To evaluate this, we pick a one-hop 
Wifi network where five nodes are connected to an AP 
(similar setup as our WLAN experiments). In each ex- 
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periment, one of the five nodes (randomly chosen), trans- 
mits a micro-block to the AP at a random time, whereas 
the other four nodes continually transfer large amounts 
of data. Each experiment runs until the micro-block 
completes, at which point we compute the delay for the 
transfer. We compare against TCP with and without 
RTS/CTS, and report aggregate numbers over five runs. 
Figure 13 shows that the transfer delay of the micro- 
block with Hop is always lower than for TCP (with or 
without RTS/CTS). In many cases, the delay gains are 
significant, e.g., for file sizes less than 16KB, the gains 
range from 3x to 15x. This experiment shows that Hop 
can be used for delay-sensitive transfers like web trans- 
fers, ssh, and SMS in many-to-one AP settings. 
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Figure 13: Hop for WLAN: Hop improves delay for all file 
sizes with improvements between 3-15 x 


5.6.2 Multi-hop transfer delay for Web file sizes 


Next, we evaluate Hop and TCP over a larger workload 
that comprises predominantly of micro-blocks. (We do 
not consider TCP with RTS/CTS enabled, since it con- 
sistently introduces more delay.) In particular, we con- 
sider a Web traffic pattern where most files are small web 
pages [5]. The flow sizes used in this experiment were 
obtained from a HTTP proxy server trace obtained from 
the IRCache project [12]. The CDF obtained was sam- 
pled to obtain the representative flow sizes used in this 
experiment. The distribution of file sizes is as follows: 
roughly 63% of the files are less than IOKB, 25% are 
between 1OKB-100KB, and remaining are greater than 
1O0KB. To stress multi-hop performance, the sender and 
receiver for each flow are chosen randomly among the 
node-pairs that were multiple hops away in our mesh 
network. Flows followed a Poisson arrival pattern with 
A = 2 flows per second. We present results from 100 
flows aggregated in bins of size [2"~1,2”] except the 
bins at the edge, i.e. <2KB, and >512. 

Figure 14 shows that Hop has less or comparable de- 
lay to TCP for almost all file sizes except those between 
16K-32K. This dip occurs because 16KB is our thresh- 
old for piggybacking data with BSYNs. This suggests 
that a slightly larger threshold might be more effective, 
but we leave the optimization for future work. For other 
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Figure 14: Performance for web traffic: Except the 32KB bin, 
Hop has comparable or better delay, with gains upto 6x 
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bins, delay with Hop is mostly lower than TCP (be- 
tween 19% higher to 6x lower than TCP), demonstrating 
its benefits for micro-block transfer. Detailed file size 
microbenchmarks in isolation (i.e., without concurrent 
transfers) show a similar behavior (detailed in [15]). 


5.7. Robustness to partitions 


A key strength of Hop is its ability to operate even under 
disruptions unlike end-to-end protocols such as TCP. We 
now evaluate how, in a partitioned scenario, Hop com- 
pares to hop-by-hop schemes such as DTN2.5 that are 
designed primarily for disruption-tolerance. In this ex- 
periment, we pick a seven hop path and simulate a par- 
tition scenario by bringing down the third node and fifth 
node in succession along the path for one minute each 
in an alternating manner. Table 4 shows the goodput ob- 
tained by Hop averaged over five runs under two differ- 
ent backpressure settings: 1) backpressure limit (7) is 
set to | and 2) backpressure limit is set to 100. Hop out- 
performs DTN2.5, a protocol specifically designed for 
partitioned settings, by 2x when H = 1, and 3x when 
H = 100. The results show that Hop achieves excellent 
throughput under partitioned settings, and a large back- 
pressure limit improves throughput by about 15%. This 
result is intuitive as having a larger threshold enables 
maximal use of periods of connectivity between nodes. 
In contrast to Hop, TCP achieves zero throughput since 
a contemporaneous end-to-end path is never available. 


- J] Goodpat (Kbps 
Hop w/ H=1 320 (29) 


Hop w/ H=100 457 (18) 
DTN2. 159 (15) 


Table 4: Goodput achieved by Hop and DTN2.5 in a partitioned 
network without an end-to-end path. 





5.8 Hop with VoIP 


In this experiment, we quantify the impact of Hop and 
TCP on Voice-over-IP (VoIP) traffic. We use two met- 
rics: 1) the mean opinion score (MoS) to evaluate the 


quality of a voice call, and 2) the conditional loss proba- 
bility (CLP) to measure loss burstiness. The MoS value 
can range from 1-5, where above 4 is considered good, 
and below 3 is considered bad. The MOS score for a 
VoIP call is estimated as in [6]. The CLP is calculated as 
the conditional probability that a packet is lost given that 
the previous packet was also lost. 

The experiment consists of a single VoIP flow and 
multiple Hop/TCP flows that transmit data continually 
over randomly picked 3-hop paths in the testbed. We em- 
ulate the VoIP flow as a stream of 20 byte packets with 
data rate at 8 Kbps. We evaluate two cases: one VoIP 
flow with five Hop/TCP flows, and one VoIP flow with 
ten Hop/TCP flows. 

Table 5 shows that Hop achieves significantly better 
throughput than TCP (in terms of median/mean) but has 
more impact on the quality of VoIP calls. This is to be ex- 
pected as TCP starves most of its flows as evidenced by 
the abysmal median throughput (1-2 Kbps), and there- 
fore has lower impact on the VoIP flow. In contrast, 
Hop obtains median throughput of a few hundreds of 
Kbps, while sacrificing a little VoIP quality. We believe 
that even this discrepancy can be reduced by exploiting 
802.1le to set larger contention window parameters to 
the background queue (e.g. higher backoff), but have not 
experimented with this so far. 


Load [| Goodput (Kbps)_[ CLP [ MOS 

5 flows Hop | Median: 468.5 0.37 | 4.12 
i | Menara, | O| 
TCP | Median: 2 0.48 | 4.19 
2 Mean i372 ag) | | 


10 flows | Hop | Median: 184 0.57 | 3.92 
om | Nenss6c4s) | | 
TCP | Median: 1.7 0.31 | 4.16 
I Nem e0ig) |S | 





Table 5: Impact of Hop and TCP on VoIP flows. Result shows 
the median/mean goodput, conditional loss probability, and 
MOS for VoIP with 95% confidence intervals in parentheses. 


5.9. Network and link layer dynamics 


Our experiments so far were run with static routes and 
with a fixed wireless bit-rate. Now, we evaluate the im- 
pact of dynamic routing using OLSR and auto bit-rate 
control using the default Madwifi Sample algorithm. We 
run TCP under all four combinations of static/dynamic 
routes and fixed/auto bit-rate selection. We compare 
these to Hop with a fixed bit-rate and static/dynamic 
routes. We are unable to evaluate Hop with auto-rate 
control as the current implementation of Hop disables 
link-layer ARQs that auto-rate control requires to esti- 
mates link quality. As in 85.3, we consider thirty con- 
current long-lived flows between randomly chosen node 
pairs, and run the experiment five times. 
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Figure 15; Hop for 30 concurrent flows under dynamic routing 


and auto bit-rate. Dots on each line shows mean goodput. Me- 
dian gains by Hop with fixed bit-rate are around 4x over TCP 
with OLSR and more than 90x over TCP with static routing. 


Figure 15 shows that Hop is better than TCP across 
all combinations, with median gains of 4x over the best 
of them. (Hop behaves almost identically with dynamic 
or static routes, therefore we only show the static case in 
the figure.) Surprisingly, we see that the best combina- 
tion for TCP is with OLSR and fixed bit-rate. OLSR 
significantly improves TCP’s median goodput or fair- 
ness, thereby reducing Hop’s gain over TCP in com- 
parison to the static case (85.3). OLSR benefits TCP 
as it constantly changes the routing topology with con- 
current TCP flows, which makes high goodput flows 
backoff and yield transmission opportunities to the previ- 
ously low goodput flows. While the constant shuffling of 
flows increases TCP’s median goodput, OLSR’s impact 
on TCP’s mean goodput is small (25%) because the links 
in the network are already heavily loaded. Auto-rate 
control makes almost no improvement to TCP since the 
testbed remains well-connected at 11 Mbps, and hence 
OLSR choses good links at this bit-rate. 


5.10 Hop under 802.11g 
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Figure 16: Hop for 30 concurrent flows under 802.11g. Dots 


on each line shows mean goodput. Hop’s median gain is 22x 
over TCP with bit-rate fixed at 24Mbps, and is 6x over TCP 
with auto-rate control. Hop’s mean gain is 3x over TCP with 
auto-rate control. 





All of our experiments so far were done with 802.1 1b. 
How does Hop perform under higher bit-rates obtained 
using 802.11g? To answer this question, we consider an 
experiment similar to that in 85.3 with thirty long-lived 
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concurrent flows between randomly chosen node pairs. 
We use a subset of our testbed (15 nodes) for this exper- 
iment as many nodes get disconnected under 802.11g. 
We ran this experiment with a static routing topology 
obtained by running OLSR under 802.11g. We consider 
Hop and TCP with a fixed 802.11g bit-rate of 24 Mbps 
that yields a reasonably connected topology, as well as 
TCP with auto-rate control. 

Figure 16 shows that Hop improves median goodput 
by 6x over TCP with auto-rate control and by 22 x over 
TCP with fixed bit-rate. The gains over TCP with auto- 
rate are lower than in the case of our 802.11b experi- 
ments in §5.3 because the maximum bit-rate in 802.11¢ 
is higher than the selected fixed bit-rate of 24 Mbps. 
Thus, TCP with auto-rate control can take advantage of 
the fact that the maximum bit-rate on 802.11 links is 54 
Mbps, whereas Hop’s bit-rate is fixed at 24 Mbps. As a 
result, the highest goodput achieved by a flow that uses 
TCP with auto-rate control is 23 Mbps, which is higher 
than Hop’s maximum goodput of 16 Mbps. The fact that 
Hop shows considerable benefits despite using a static 
best bit-rate suggests that Hop with a good bit-rate selec- 
tion scheme can benefit even more. 

Figure 16 also shows that auto-rate control improves 
TCP’s fairness (median goodput increases by 3.2 x) but 
hurts network utilization (mean goodput decreases by 
65%). This is because auto-rate improves the low good- 
put flows over lossy links by reducing the bit-rate (and 
thereby the loss rate), but impacts high goodput flows 
as flows over low bit-rate links are slow and consume a 
large portion of transmission opportunities. 


5.11. Discussion: Hop vs. TCP 


Although the above results show Hop’s benefits across a 
wide range of scenarios, our evaluation has some limi- 
tations. First, our results are based on a 20-node indoor 
testbed, so we can not claim that they will hold in other 
wireless mesh networks. For example, it is conceivable 
that the benefits due to ack withholding are because of 
hidden terminals specific to our testbed’s topology. Nev- 
ertheless, our experience with Hop has been encourag- 
ing. Over the last few months, we have experimented 
with different node placements, static topology configu- 
rations, and diurnal as well as seasonal variations in cross 
traffic and channel conditions, and have seen results con- 
sistent with those described in this paper. Second, we 
have not compared Hop to a large number of proposed 
TCP modifications for multi-hop wireless networks for 
which implementations are not available (refer 86.1). We 
present Hop as a simple and robust alternative to end-to- 
end rate control schemes, but do not claim that end-to- 
end rate control can not be fixed to obtain comparable 
benefits at least in well-connected environments. 

TCP’s strengths are undeniable. Under high load, it is 


USENIX Association 


USENIX Association 


difficult to outperform TCP significantly in terms of ag- 
gregate throughput (refer Figures 11 and 16). TCP backs 
off aggressively on bad paths reducing contention for 
flows on good paths resulting in an efficient but unfair al- 
location. TCP has a similar effect on hidden terminals— 
by squelching most of the colliding flows, TCP in effect 
unfairly serializes them but ensures high throughput. Fi- 
nally, despite its many woes in wireless environments, 
TCP enjoys the luxury of experience through widespread 
deployment, setting a high bar for alternate proposals. 

Hop is not designed to be TCP-friendly. For exam- 
ple, in the 30 flow scenario, if we convert just 7 of the 
30 TCP flows to use Hop instead of TCP, the median 
goodput of the remaining 23 drops by an order of mag- 
nitude [15]. This is unsurprising as Hop’s bursty traffic 
increases the loss and contention perceived by TCP flows 
causing them to aggressively back off. 


6 Related work 


Wireless transport, especially the performance and fair- 
ness of TCP over 802.11, has seen large body of prior 
work. Our primary contribution is to draw upon this 
work and show that reliable per-hop block transfer is a 
better building block for wireless transport through the 
design, implementation, and evaluation of Hop. 


6.1 Proposed alternatives to TCP 


TCP performance: TCP’s drawbacks in wireless net- 
works include its inability to disambiguate between con- 
gestion and loss [2], and its negative interactions with 
the CSMA link layer. Proposed solutions include: 1) 
end-to-end approaches that try to distinguish between 
the different loss events [25], attempt to estimate the 
rate to recover quickly after a loss event [19], or re- 
duce TCP congestion window increments to be fractional 
[21], 2) network-assisted approaches that utilize feed- 
back from intermediate nodes, either for ECN notifica- 
tion [38], failure notification [17] or for rate estimation 
[32], and 3) link-layer solutions that use a fixed win- 
dow TCP in conjunction with link-layer techniques such 
as neighborhood-based Random Early Detection ([9]) or 
backpressure flow control (RAIN [16]) to prevent losses 
due to link queues filling up. 

TCP fairness: TCP unfairness over 802.11 stems pri- 
marily from: 1) excess time spent in TCP slow-start, 
which is addressed in [32] by use of better rate esti- 
mation, and 2) interactions between spatially proximate 
interfering flows [37, 29] by using neighborhood-based 
random early detection and rate control techniques. 

In comparison to the above schemes, Hop does not 
rely on end-to-end rate control, and thereby eliminates 
the complex interaction between TCP and 802.11 that is 
the root of its performance and fairness problems. In- 
stead, Hop uses simple mechanisms—batching, hop-by- 


hop backpressure and ack withholding—to improve per- 
formance as well as fairness. Hop requires no modifica- 
tions to the 802.11 MAC protocol. 


6.2 Implemented alternatives to TCP 


Few implemented alternatives to TCP are available for 
reliable transport in 802.11 networks today. At the time 
of writing, we found only two such implementations— 
TCP Westwood+ and DTN2.5—both of which we com- 
pare against Hop. Hop’s use of hop-by-hop reliability 
and backpressure is similar to a recent proposal, CXCC 
[31], but differs in its use of burst-mode, ack withhold- 
ing, virtual retransmissions, etc. We could not compare 
Hop against CXCC as it is not implemented for 802.11. 

Two recent systems, WCP [30] and Horizon [27], 
also address TCP’s performance and fairness problems 
over 802.11. WCP, similar in spirit to NRED [37], 
augments TCP’s end-to-end rate control with network- 
assisted feedback about contention along the path. WCP 
shows significant gains in median throughput (or fair- 
ness) under load, but often reduces the mean through- 
put considerably. Horizon uses backpressure scheduling 
with multi-path routing as a shim between unmodified 
TCP and 802.11 layers, and shows improved fairness un- 
der load in a majority of experimental runs at the cost 
of mean throughput. In comparison, Hop consistently 
shows significant improvement in fairness and mild im- 
provement in mean throughput under load. Although we 
have not performed a head-to-head comparison to Hop, 
we note that both WCP and Horizon rely on link-layer 
ARQ per frame that our experiments (Figures 7 and 10) 
suggest are inefficient for lossy wireless links. 


6.3. Other related work 


Backpressure: Backpressure was first investigated in 
ATM [24] and high-speed networks [20] to handle data 
bursts. A seminal paper by Tassiulas and Ephremides 
[33] showed that backpressure scheduling can achieve 
the stable capacity region of a wireless network. This 
paper sparked off a large body of theoretical work [34] 
on optimal scheduling, routing, and flow control in wire- 
less networks. However, backpressure scheduling is NP- 
hard, incurs a high signaling overhead per transmission, 
and is difficult to implement with the 802.11 MAC layer, 
so few practical implementations exist. 

In recent times, backpressure-like ideas have been 
adapted for congestion control as an alternative to TCP 
[31] or underneath TCP [16, 27]; for unreliable hierar- 
chical data aggregation in sensor networks [11]; for reli- 
able bulk transport in linear sensor networks and a single 
flow [14], etc. In comparison, Hop performs backpres- 
sure over blocks to amortize the signaling overhead, uses 
ack withholding to to alleviate hidden terminal losses, 
and uses per-hop reliability with virtual retransmissions 
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to efficiently deal with in-network losses. 

Batching: Neg et al. [22] show that adapting the burst 
size of txop’s in 802.1 le to the load can improve TCP 
fairness in WLAN settings. WildNet [26] leverages 
batching with FEC and bulk acknowledgments at the 
link layer over long-distance unidirectional 802.11 links. 
Kim et al. [35] aggregate TCP frames using the 802.1 1n 
burst mode to amortize the MAC protocol overhead. In 
comparison, Hop jointly leverages batching both at the 
link and transport layers. 


7 Conclusions 


The last decade has seen a huge body of research on 
TCP’s problems over wireless networks, but TCP for 
good reasons continues to to be the dominant real-world 
alternative today. One reason may be that TCP is good 
enough in the common case of wireless LANs, and so- 
lutions proposed for more challenged environments do 
not perform well in the common case. A natural ques- 
tion is if we can have one simple transport protocol that 
yields robust performance across diverse networks such 
as WLANs, meshes, MANETs, sensornets, and DTNs. 
Our work on Hop suggests that this goal is achievable. 
Hop achieves significant throughput, fairness, and de- 
lay gains both in well-connected WLANs and mesh net- 
works as well as disruption-prone networks. 
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Abstract 


Despite many attempts to fix it, the Internet’s interdo- 
main routing system remains vulnerable to configuration 
errors, buggy software, flaky equipment, protocol oscil- 
lation, and intentional attacks. Unlike most existing so- 
lutions that prevent specific routing problems, our ap- 
proach is to detect problems automatically and to iden- 
tify the offending party. Fault detection is effective for a 
larger class of faults than fault prevention and 1s easier to 
deploy incrementally. 

To show that fault detection is useful and practical, we 
present NetReview, a fault detection system for the Bor- 
der Gateway Protocol (BGP). NetReview records BGP 
routing messages in a tamper-evident log, and it enables 
ISPs to check each other’s logs against a high-level de- 
scription of the expected behavior, such as a peering 
agreement or a set of best practices. At the same time, 
NetReview respects the ISPs’ privacy and allows them to 
protect sensitive information. We have implemented and 
evaluated a prototype of NetReview; our results show 
that NetReview catches common Internet routing prob- 
lems, and that its resource requirements are modest. 


1 Introduction 


Global Internet connectivity is the result of a competitive 
cooperation of tens of thousands of Autonomous Sys- 
tems (ASes) using the Border Gateway Protocol (BGP). 
Unfortunately, interdomain routing is plagued with many 
serious problems: BGP is hard to manage, and BGP mis- 
configurations and software bugs can create severe net- 
work disruptions [8, 24, 37]. Equipment failures in one 
AS can cause route flapping and trigger excessive routing 
announcements in ASes many hops away [35]. The inad- 
vertent configuration of conflicting routing policies in a 
collection of ASes can lead to persistent oscillation [14]. 
An adversary that controls a BGP-speaking router can 1n- 
tentionally ‘hijack’ another AS’s address block in order 
to discard the data packets, snoop on the traffic, imper- 
sonate the legitimate destination, or send spam [25, 27]. 
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Many (but not all) of these problems are rooted in the 
absence of a mechanism to verify routing information. 
BGP essentially allows anyone to announce any route, 
whether that route actually exists or not. Hence, there 
has been a lot of work on securing BGP. However, most 
of this work focuses on fault prevention, that is, mask- 
ing routing problems by suppressing invalid route an- 
nouncements. This approach is effective against many 
common problems, but it cannot prevent other, equally 
common faults; for example, an ISP might fail to an- 
nounce a route because of an incorrect export filter. Ex- 
isting security extensions to BGP, such as S-BGP [22] 
and soBGP [34], are not effective against such faults. 
Moreover, existing fault prevention systems require sig- 
nificant buy-in before they can yield much benefit, and 
they require an Internet-wide public-key infrastructure 
(PKI); for these and other reasons, prevention systems 
have not yet achieved widespread deployment. 


In this paper, we take a different and complementary 
approach, namely fault detection. If we cannot prevent 
every routing problem, why not at least ensure that each 
problem is detected and linked to the ISP that caused it? 
Fault detection is easy to deploy incrementally: it does 
not require a central PKI or cryptography on the criti- 
cal path, and it yields benefits even when the deployment 
consists of just a few ISPs (or even a single ISP). More- 
over, if we accept the possibility of some delay between 
the occurrence of a fault and its detection, we can catch a 
very general class of faults, including router and link fail- 
ures, software bugs, misconfigurations, policy violations, 
and even attacks by hackers or spammers. In particular, 
we can detect faults that would be difficult or impossible 
to prevent, e.g., when a faulty or misconfigured router 
fails to propagate certain routes. 


Fault detection has two main benefits. The first (and 
most obvious) benefit is that ISPs are automatically in- 
formed about routing problems and their causes, which 
enables them to respond quickly. Thus, ISPs no longer 
have to rely on monitoring heuristics or customer com- 
plaints to find out about problems, which increases cus- 
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tomer satisfaction and enables ISPs to swiftly respond 
even to minor problems. Also, since detection links 
faults to their causes, ISPs no longer need to diagnose 
faults manually. Finally, ISPs obtain a ‘safety net’ that 
enables them to respond to unexpected problems. 

The second, more indirect benefit of fault detection is 
that it makes an ISP’s reliability transparent. Today, ISPs 
may have little to gain from pushing reliability beyond 
a certain point, since customers cannot easily attribute 
a given routing problem to a particular ISP. Fault de- 
tection is an opportunity for reliable ISPs to showcase 
their good performance and to distinguish themselves 
from the competition, which could help them attract new 
customers. In the long term, this could even result in a 
market for reliability, in which customers could directly 
compare the routing performance of potential providers. 

At first, fault detection may appear to be a simple mat- 
ter of keeping logs of all routing messages and inspect- 
ing them (perhaps even manually) for routing problems. 
However, the problem is complicated by several unique 
aspects of the interdomain routing system. First, detect- 
ing certain types of faults requires that ISPs share infor- 
mation, because the fault cannot be detected based on 
one ISP’s view of the network alone. However, ISPs wish 
to minimize the amount of information they release to 
their competitors. Thus, a detection system must balance 
its detection power against the scope of the information 
ISPs need to release. Second, the amount of log data col- 
lected is so vast that manual inspection is out of the ques- 
tion, except in the most egregious cases. Third, the logs 
may be incomplete or even incorrect, not least because 
the routing system is often attacked by hackers who may 
try to manipulate records in order to cover their tracks. 
Finally, if the information about faults 1s to be used as a 
measure of reliability, we must avoid both false positives 
and false negatives, which rules out heuristic solutions. 

To demonstrate that fault detection is viable, we 
present NetReview, a system that implements fault de- 
tection for BGP. NetReview reliably and automatically 
detects routing problems by checking secure traces of 
BGP messages against high-level specifications of the 
expected routing behavior. NetReview respects the ISPs’ 
privacy and provides strong guarantees: it does not pro- 
duce false positives or false negatives even when under 
attack by a Byzantine adversary. Using a prototype im- 
plementation of NetReview, we show that its resource 
requirements are modest, and that it is effective against 
common Internet routing problems. 

Existing work on securing interdomain routing has 
proven difficult to deploy. A natural question is whether 
a fault detection system would be hampered by similar 
problems. To address this concern, we show that Net- 
Review can overcome common deployment hurdles: it 
can work with existing router hardware, it does not re- 
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quire a global PKI, it can be deployed incrementally, and 
it offers immediate benefits to early adopters. 


The rest of this paper is structured as follows. In Sec- 
tion 2, we begin by giving some background on BGP, 
and we discuss the specific challenges of BGP fault de- 
tection. In Sections 3 and 4, we present the design of 
NetReview and its specification language. In Section 5, 
we report results from a feasibility study to show that 
fault detection is practical. In Section 6, we present so- 
lutions to various deployment-related problems, such as 
operation in a partial deployment or without a CA, and 
we point out incentives for adoption by ISPs. In Sec- 
tion 7, we describe some advanced features that could be 
added to NetReview. Section 8 discusses related work, 
and Section 9 concludes this paper. 


2 Background 


2.1 Interdomain routing with BGP 


The Internet consists of independent administrative en- 
tities called autonomous systems (ASes). An AS usu- 
ally corresponds to a network run by an Internet Service 
Provider (ISP), although some large ISPs have multiple 
ASes. Each AS is assigned a unique AS number (ASN); 
in 2008, about 40,000 ASNs were in active use. In ad- 
dition, each AS owns a set of IP addresses, which it can 
assign to its hosts and routers. Usually, ASes use large 
contiguous sets of addresses that share a common prefix; 
for example, the prefix 128.42.0.0/16 covers all IP 
addresses whose first two octets are 128 and 42. 


To exchange routing information with each other, all 
ASes use the Border Gateway Protocol [28]. Each AS 
designates some of its routers as BGP speakers, which 
are then connected to BGP speakers in adjacent ASes. 
When a BGP speaker learns of a route to a new prefix, it 
can announce that route to its peers in adjacent ASes; if 
the route becomes unavailable later, it must withdraw the 
announcement. BGP is a path-vector protocol, that is, 
each announcement contains the sequence of ASes that 
the route traverses in an attribute called AS_PATH. 


BGP specifies a mechanism for exchanging routing 
information. Which routes to use and whether or not 
to announce them to peers is decided independently by 
each AS according to its own policy; for example, an AS 
might prefer short routes to reduce latency. Some aspects 
of the policy are determined by an AS’s business rela- 
tionships; for example, an AS might agree to act as the 
provider for another AS, and it would then be expected 
to offer its customer a route to every prefix it can reach. 
Adjacent ASes usually sign a peering agreement, which 
specifies the obligations of each peer. 
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2.2 What is a BGP fault? 


The specification of BGP in RFC 4271 [28] describes a 
message format and a few basic rules; everything else 
is left to the implementation and the policy of an AS. 
Therefore, we use a very generic definition of a BGP 
fault. Suppose we have a complete message trace M, 
of all BGP messages a given AS a has sent or received 
over time (both internally and to/from its peers). Then 
we simply assume that there is a deterministic function 
F,(Mgz,t), and we say that AS a is faulty at time t if and 
only if F,(M,,t) = true, otherwise we say that AS a 
is correct at time t.! Note that F,, is specific to AS a; a 
different AS 6 could have a different function F%. 

How can such a function Fi, be defined? There are 
several sources of information that can be used for this 
purpose (of course, multiple sources can be combined): 


e RFC 4271: The AS is faulty if it violates the BGP 
specification, e.g., by sending a malformed mes- 
sage, or by announcing a path that contains a loop. 


e ASN and prefix assignment: The AS is faulty if 
it uses a foreign AS number, or if it announces a 
prefix it does not own. 


e BGP best practices: The AS is faulty if it does not 
follow current best practices, e.g., by failing to ag- 
gregate prefixes correctly. 


e Peering agreements: The AS is faulty if it does not 
honor the peering agreements it has negotiated with 
its peers, e.g., by failing to export its customers’ 
routes, or by choosing a route through an AS it has 
promised to avoid. 


e Connectivity: The AS is faulty if it fails to offer 
routes to certain prefixes, e.g., because an internal 
link or equipment failure has caused a partition. 


e Internal goals: The AS is faulty if its routers fail to 
achieve some goal the AS has set for itself, e.g., by 
choosing an expensive route over a cheaper one due 
to a configuration error. 


Note that our definition does not say who defines Fy 
and who evaluates it; we will address these challenges 
in Section 3, and we will show which information needs 
to be shared to ensure that faults are detected. Also, our 
definition does not imply that there is a unique correct 
message trace for each AS. For example, if an AS is of- 
fered multiple routes to a given prefix and its policy does 
not prefer any route in particular, it can choose any route. 

According to our definition, each fault is local to a sin- 
gle AS. Thus, if a faulty AS a exports a bad route to a 


'A similar definition can be used for router-level faults. We focus 
on AS-level faults because they are more general. 


neighbor 6, 6 does not become faulty for propagating the 
route — except if propagating the route constitutes a fault 
according to its own function F;,. A special case occurs 
when a link between two neighboring ASes fails. Since 
the link is shared by two ASes, we cannot attribute this 
event to an individual AS, so we attribute it to the pair of 
ASes instead. 


2.3 Challenges in BGP fault detection 


To illustrate the challenges in building a practical BGP 
fault detection system, we first consider a simple straw- 
man implementation of fault detection that works as fol- 
lows. Every ISP enables full logging on all their routers 
and periodically uploads the logs to a central server, to- 
gether with a description of their peering agreements and 
internal goals. Because the central server has full infor- 
mation, it can reconstruct the message trace /, for each 
AS a, and it can evaluate Ff, for any (past) point in time. 
This solves the fault detection problem because the cen- 
tral server can eventually detect any BGP fault, no matter 
how complex it is. 

However, there are several reasons why this strawman 
solution would not work in practice: 


e Privacy: The strawman’s logs contain sensitive in- 
formation that ISPs would not agree to reveal to a 
third party, such as their routing policy and internal 
topology. A practical system must protect the ISPs’ 
business secrets while retaining its detection power. 


e Reliability: The information in the strawman’s logs 
is not necessarily accurate: routers can malfunction, 
and hackers can tamper with the logs to conceal an 
attack. A practical system must ensure that no faults 
go undetected, even when it is under attack. 


e Automation: Collecting and processing the vast 
amounts of trace data could prove expensive. A 
practical system must be able to efficiently check 
this data without manual intervention. 


e Decentralization: It is unlikely that ISPs around 
the world would accept and trust a single fault de- 
tector entity. A practical system must not introduce 
any new trusted entities or require ISPs to coordi- 
nate with ISPs they do not already cooperate with. 


e Deployability: The strawman assumes global de- 
ployment. A practical system must have a clear 
deployment path, with immediate benefits for early 
adopters and a migration path for legacy equipment. 


A fault detection system for BGP should address these 
five challenges. 
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BGP speaker Tamper-evident logs 


Z \nternal router 








Recorded Not recorded 


Figure 1: System model. Each BGP speaker main- 
tains a tamper-evident log of the BGP messages it ex- 
changes with other ASes. Internal routing messages are 
not recorded. 


3 NetReview 


To demonstrate that the above challenges can be ad- 
dressed in a practical system, we now present a detection 
system called NetReview. For clarity of presentation, we 
initially assume that NetReview is deployed universally, 
and that the allocation of ASNs and IP prefixes to ASes is 
certified by a trusted certification authority (CA). In Sec- 
tion 6, we describe solutions for partial and incremental 
deployment, and we show how NetReview can be used 
without a CA. 


3.1 Overview 


At a high level, each BGP speaker maintains a log of all 
the BGP messages it sends and receives (Figure 1). In 
addition, each AS states a set of rules that describe best 
practices, routing policies, etc. that the AS adopts (the 
union of these rules specify /, and thus define what con- 
stitutes a fault; they do not necessarily describe the entire 
routing policy of the AS). Both the logs and the rules are 
then made available to certain other ASes, who can audit 
them to check whether the rules have been followed. If a 
rule was broken, NetReview guarantees that at least one 
auditor can detect this and obtain verifiable evidence of 
the fault, which it can then use to convince third parties. 

NetReview only records BGP messages that are ex- 
changed with other ASes, but no internal routing mes- 
sages. Thus, the log only contains information that an 
AS would reveal to other ASes anyway; the ISP’s pro- 
prietary information, such as its internal topology, is not 
revealed. In addition, each ISP is free to decide which 
rules it wants to reveal to each auditor. For example, an 
ISP might choose to reveal its best-practice rules to ev- 
eryone, and, in addition, it might reveal to each of its 
business partners a set of rules that describes its policy 
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towards that partner. This is safe because the partner al- 
ready knows that aspect of the policy from the peering 
agreement. 

NetReview uses cryptographic authenticators [17] to 
detect if routing messages are not logged correctly. The 
log itself is tamper-evident, that is, it can detect if log 
entries are modified after the fact. Thus, NetReview can 
guarantee that log corruption — due to software bugs or 
hardware malfunctions — cannot cause faults to go un- 
detected. This guarantee holds even in the presence of 
Byzantine faults, e.g., when hackers or spammers at- 
tempt to cover up the traces of an attack. 

NetReview includes a simple specification language 
for writing rules. The resulting rules can be checked ef- 
ficiently; we show that a commodity workstation 1s suf- 
ficient to audit several ASes in real time. 

NetReview is designed to leverage existing trust and 
business relationships between neighboring ASes. We 
consider two ASes to be neighbors if they are connected 
by a direct link. 


3.2 Assumptions and guarantees 


NetReview’s design relies on the following assumptions: 


1. Each AS has at least one diligent neighbor. By 
diligent, we mean that this neighbor regularly audits 
the AS and collects evidence. This is a reasonable 
assumption because ASes have a natural interest in 
learning about routing problems of their neighbors. 


2. Each AS is willing to publish a list of its neigh- 
bors. Knowing the nature of the business relation- 
ships 1s not necessary, just the fact that two ASes 
are connected. This is a reasonable assumption, be- 
cause the information can already be determined us- 
ing tools like traceroute or RouteViews [30]. 


3. Each AS can eventually send control messages 
to any other AS. This property holds for the Inter- 
net because the AS graph is connected, and because 
link failures are repaired in a timely fashion (that 1s, 
within at most a few days). 


4. No attacker can invert the hash function or 
break cryptographic keys. This is a common as- 
sumption for protocols that rely on cryptography. 


Note that NetReview is not subject to the limitations 
for Byzantine fault tolerance techniques, such as the need 
for 3f + 1 replicas to tolerate f faults. Fault detection is 
an easier problem, so this bound does not apply. 

NetReview focuses on detecting observable faults, 
that is, faults that a) causally affect at least one non-faulty 
AS [16], and b) violate a rule that is revealed to at least 
one diligent AS. This restriction is inevitable because we 
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cannot expect faulty routers to help with fault detection. 
An example of an unobservable fault would be two faulty 
routers sending bad routing updates to each other, but 
neither of them logging the messages or forwarding the 
authenticators to the other’s neighbors. Such a fault can- 
not be detected as long as it does not affect a correct AS. 
Under the above assumptions, NetReview guarantees 
that a) any observable fault is eventually detected and 
irrefutably linked to a faulty AS, and that b) no verifiable 
evidence is ever generated against a non-faulty AS. 


3.3. Maintaining tamper-evident logs 


In NetReview, each border router maintains a log of all 
routing messages it has sent to, or received from, a router 
in another AS. In addition, the logs contain periodic 
checkpoints of the BGP routing tables, as well as a hash 
of each rule the AS has adopted. This additional infor- 
mation is needed for auditing and will be discussed in 
Sections 3.8 and 3.9, respectively. 

NetReview’s logs are based on the logs in PeerRe- 
view [17]. The logs are tamper-evident, that is, a router 
either records precisely the messages it has exchanged 
with other routers, or it is possible to detect that the router 
is faulty. Note that, since our goal is fault detection, we 
do not need to prevent faulty routers from tampering with 
their logs — being able to detect tampering is sufficient 
because it is clear evidence of a fault. Specifically, Net- 
Review detects if a router (1) records a message it did 
not send or receive, (11) omits a message it did send or 
receive, (111) changes an existing log entry, or (iv) keeps 
multiple logs or a branched log. For lack of space, we 
only sketch the most important aspects of the log here. 
Please refer to [17] for a complete description. 

Operation: Each log is structured as a hash chain, 
1e., every entry e; is associated with a sequence num- 
ber s; and a hash value fh; that covers the entry itself and, 
transitively, all the previous entries. To explain the proto- 
col for logging message exchanges, we use the example 
of two routers, Alice and Bob. Whenever Alice sends a 
message m to Bob, Alice first appends a SEND (m) entry 
to her log and then attaches an authenticator to m, which 
is a signed statement that Alice has logged the transmis- 
sion of m. The authenticator a; = C4jice (Si, h;) for an 
entry e; includes the entry’s value in the hash chain h, 
and is signed with Alice’s cryptographic key o4);-¢. The 
authenticator has two purposes: first, it convinces Bob, 
and any auditors of Bob’s log, that the message is authen- 
tic, which rules out (1). Second, it serves as evidence that 
a SEND (m) entry must appear in Alice’s log, which ad- 
dresses case (11) and, because of the hash chain, case (111). 
When the message ™ arrives, Bob appends a RECV (m) 
entry to his log and then returns an acknowledgment to 
Alice, which includes an authenticator for the RECV (m) 


entry. At this point, both Alice and Bob have obtained 
evidence that the other side has properly recorded the 
message in their log. 

NetReview imposes a limit on the number of unac- 
knowledged messages that can be in flight between Alice 
and Bob at any given time. If this limit is reached, e.g., 
during an unplanned physical link failure or because Bob 
refuses to send acknowledgments, the operators are no- 
tified and must resolve the problem by leveraging their 
existing business relationship. 

What if Alice or Bob log the message at first but mod- 
ify or remove it later?) When Bob receives the authentica- 
tor from Alice, he detaches it from the message (to save 
bandwidth) and forwards it to Alice’s neighbors. Thus, 
Alice’s neighbors eventually learn of all log entries for 
which Alice issued authenticators. Each neighbor pe- 
riodically inspects Alice’s log to check whether these 
entries actually appear. If an authenticator is properly 
signed but the corresponding entry is missing, then Alice 
must have tampered with the log, maintained multiple 
logs or a log with multiple branches, and the authentica- 
tor is a signed confession. This addresses (iv). 

Protocol support: NetReview extends BGP with sup- 
port for authenticators and acknowledgments. To limit 
the crypto overhead during bursts of updates, it also in- 
troduces a new composite message that allows multiple 
updates to be covered by a single authenticator (and thus 
by a single signature). We call this protocol variant BGP 
with acknowledgments, or BGP-A. 

Log truncation: Routers require some storage for 
keeping the log. This storage does not have to be in the 
router itself — it could be on a separate blade, or on an- 
other computer — but capacity is limited, and log entries 
cannot be stored indefinitely. Therefore we allow routers 
to discard entries that are older than some time Ting x, 
e.g., one year. Since the log contains periodic snapshots 
of the routing tables, discarding old entries does not de- 
stroy information about long-lived routes. 

For routers to agree when Tiqx elapses, clocks must 
be loosely synchronized, e.g., within a few hours. Net- 
Review enforces this by checking the timestamps on the 
authenticators. If a router’s clock is not set properly, its 
messages will not be accepted by the adjacent routers. 

If a log entry were not audited at least once during 
its lifetime, some faults could remain undetected. How- 
ever, the typical audit period can be expected to be much 
shorter than the lifetime of log entries because ASes are 
likely to be interested in timely fault detection. 


3.4 Auditing 


To ensure that no fault goes undetected, the logs of each 
AS must be inspected regularly. Technically, it is pos- 
sible to allow each AS to audit any other AS; however, 
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NetReview requires only that each AS audit the logs of 
its neighbors. Neighbors have a natural incentive to learn 
about each other’s routing problems, and because of their 
existing business relationships, they are in a good posi- 
tion to take action if a problem is discovered. Also, re- 
call our assumption that each AS has at least one diligent 
neighbor; this ensures that each log entry is properly in- 
spected at least once. 

To inspect an interval J := |[t,, t2| of a target’s log, the 
auditor proceeds as follows: 


1. If the auditor is not a neighbor of the target, it asks 
the target’s neighbors for authenticators from inter- 
val I. 


2. The auditor asks each of the target’s border routers 
for a set of rules” and a signed segment of its log 
that covers interval J. 


3. The auditor checks whether the following properties 
hold for the set of logs it has obtained: 


e Consistency: All authenticators match an en- 
try in one of the logs. 


e Conformance: The sequence of messages in 
each log conforms to BGP-A. 


e Compliance: The target has followed each of 
the rules it has revealed. 


3.5 Extracting evidence 


When an auditor discovers an interval I’ := [t,t5] C I 
for which one of the above properties does not hold, it 
extracts the corresponding log segment, starting at the 
most recent snapshot. Then it removes all entries that 
are not essential for checking the property (such as addi- 
tional snapshots), as well as any parts of the first snapshot 
that are not needed to replay this particular segment. The 
result is a compact data structure that irrefutably ties the 
fault to the cryptographic key of the responsible AS, and 
thus (via the certificate) to its principal. This data can be 
used as evidence of the fault, and a third party can verify 
it independently without having to repeat the audit. 

Once an auditor has obtained evidence, it notifies the 
local administrator, who can use the evidence in sev- 
eral ways. For example, if a best-practice rule has been 
violated, the auditor can choose to make the evidence 
publicly available; thus, it is possible to evaluate an 
ISP’s performance by asking its neighbors for evidence 
of faults. If a private rule was broken, the evidence can 
be used to convince an arbitrator or a judge. 


*In Section 3.9, we describe how the auditor can verify that the rules 
are genuine. 
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3.6 Consistency and conformance checks 


The consistency check detects if the target AS has tam- 
pered with its log. Recall that each BGP-A message or 
acknowledgment contains a signed authenticator that is 
linked to a specific log entry, and thus to a specific point 
in the hash chain. If the target has returned a valid log 
segment, it will be consistent with all the authenticators; 
otherwise the log segment and the mismatched authenti- 
cator constitute a proof of misbehavior. Since neighbors 
collect each other’s authenticators, and since we assume 
that each AS has at least one diligent neighbor, we know 
that any forged, omitted, or modified log entry is eventu- 
ally detected by at least one neighbor. 


The conformance check detects if the target has de- 
viated from the BGP-A protocol. This is a purely syn- 
tactic check that does not consider which routes were 
announced, but rather how they were announced. For 
example, NetReview checks whether each message was 
well-formed and whether sessions were opened with the 
proper handshake before announcements were sent. 


If the target AS passes the consistency and confor- 
mance checks, the auditor is convinced that the log ac- 
curately reflects the target’s BGP traffic. The remaining 
check is designed to detect routing problems. 


3.7 Extracting the routing state 


The previous two checks are performed on logs from in- 
dividual border routers of an AS. However, many routing 
problems arise because of inconsistencies between mul- 
tiple routers. Therefore, the auditor must perform the 
compliance check based on the ‘global’ routing state of 
the AS, which it obtains by merging the logs from the 
individual routers. 


NetReview models the ‘global’ routing state of an AS 
as follows. At any given point in time, the AS has a set of 
peering points with neighboring ASes, and for each peer- 
ing point there are two routing information bases (RIBs): 
the outRIB contains routes that the AS has announced 
to its neighbor, and the inRIB contains routes that the 
neighbor has announced to the AS. Since BGP does not 
permit the announcement of multiple alternative routes, 
each RIB can contain at most one route for each prefix. 


To determine how the target’s routing state evolved 
over time, the auditor starts by loading the oldest check- 
point from each log, which contains a snapshot of the 
RIBs. Then it repeatedly picks the unprocessed message 
entry with the earliest timestamp across all logs, and it 
applies the updates in the message to the corresponding 
pair of RIBs. Thus, it obtains a sequence of routing states 
S(t;), where t; indicates the time of the message that 
triggered the change. Note that each S(t;) contains a 
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pair of RIBs for each router or peering point; it is not a 
‘global’ RIB for the entire AS. 


3.8 Compliance check 


The compliance check detects if the target has broken 
any of its rules. Conceptually, this is done by checking 
each rule against each of the routing states S(t;). Recall 
that even the complete set of rules does not necessarily 
amount to a full specification of the AS’s routing pol- 
icy; thus, checking rules is not equivalent to re-evaluating 
each routing decision the AS has made. 

We have developed a simple specification language 
that ASes can use to formulate rules. In this lan- 
guage, a rule is written as a predicate on an indi- 
vidual routing state S(t;). For example, the rule 


Vevr € outRIB(18, c) : 
(prefix(r) € P) = (123 © communities(r) ) 


stipulates that, when a route r belongs to a prefix from 
the set P and is announced to AS 18 over any peering 
point c, r must be tagged with the community 123. We 
give more details on the specification language and the 
rule checker in Section 4. 


3.9 Rule commitment and access control 


For the compliance check, the auditor must know which 
rules should hold for the target during the audited inter- 
val. Also, if a rule is violated, the auditor should obtain 
evidence that the rule existed at the time of the fault. 
The easiest way to accomplish both would be to sim- 
ply record the rules in the tamper-evident log. However, 
since the logs are visible to each of the target’s neigh- 
bors, this might reveal proprietary information about the 
target’s routing policies. 

Instead, we only require that ASes commit to their 
rules by logging a hash value H(s;,7;) for each rule r;. 
s; 18 a 128-bit salt, which makes it difficult for an inquisi- 
tive auditor to learn sensitive information by checking for 
well-known rules, or to run a dictionary attack. On the 
other hand, if an auditor knows r; and s; a priori (per- 
haps from a peering agreement it shares with that AS, or 
because the AS has revealed them earlier), it can easily 
check whether the corresponding hash value is present. 
If not, it can use the log as evidence and file a complaint 
against the AS for breaking the contract. 

Why would an AS commit to any rules at all, and why 
would it reveal a rule to an auditor? For example, ASes 
can use NetReview to enforce provisions from their peer- 
ing contracts. The parties could agree to a set of rules and 
add them to their respective logs; they would then reveal 
these rules to each other, but not to anyone else. Or an 


AS could adopt a set of best-practice rules to highlight 
its good performance, and reveal these rules to everyone. 


4 Writing and checking rules 


NetReview includes a simple specification language that 
ASes can use to formulate rules. In this section, we de- 
scribe this language in more detail, and we explain how 
rules in this language are evaluated. 


4.1 Language design 


The language includes three features we believe to be key 
for BGP fault detection. First, the language is declara- 
tive and refers to a high-level property, rather than to a 
specific algorithm for choosing routes. This makes rules 
easier to write and debug than, say, router configura- 
tion files. Moreover, many properties can be specified 
as rule templates that only require a few AS-specific pa- 
rameters. A number of common templates are already 
included with NetReview. 

Second, rules are partial specifications of the expected 
behavior. The above example only describes what should 
happen to routes that are announced to AS 18 and whose 
prefix is in P, but it does not say anything about the other 
routes. Thus, an AS can reveal a rule without revealing 
its entire routing policy. Also, we can vary the strength 
and number of rules and thus control how restrictive the 
checking should be. 

Finally, rules are time-local, that is, they depend only 
on a small number of past and future states. This is pos- 
sible because interdomain routing is essentially memo- 
ryless: whether or not a route is exported depends solely 
on which routes are currently available; it is irrelevant 
whether a route was available earlier, or will become 
available later.? This improves efficiency considerably, 
since NetReview only needs to remember a small num- 
ber of routing states at any given time. 


4.2 Specifying rules 


Each NetReview rule consists of a set of constants and a 
set of predicates in first-order logic. The predicates are 
written using boolean operators, existential and univer- 
sal quantifiers, and equality. They can use two functions 
called inRIB(i, j) and outRIB(i, j) to access the 
RIBs for a peering point 7 with a neighbor with AS num- 
ber 7. An optional third argument contains an interval 
operator. 


3A notable exception is age-based tie breaking. We handle this by 
including the age of each route in the RIBs. 
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4. 


const setof(integer) asns = { 8, 9 }; 
forall cpref in affectedPrefixes, peer in asns { 


forall pl in peeringPoints (peer), p2 in peeringPoints (peer) { 
(pl != p2) => forall route in outRIB(peer, pl, intersect [now-5.0s,now]) { 
(prefix(route) == cpref) => exists route2 in outRIB(peer, p2, union[now-5.0s,now]) { 


(prefix(route2) == prefix(route)) and 


Vs a a 


(sizeof (as_path (route2) ) 


== sizeof (as_path (route) )) 


Figure 2: Example rule in the syntax used by the NetReview rule checker. intersect [a,b] selects routes that 
were announced continuously between time a and time b, while union[a,b] selects routes that were announced at 
any point between time a and time b. now is the point in time for which the rule is evaluated. 


Additionally, NetReview’s rule checker has several 
built-in functions and operators for manipulating num- 
bers, routes, sets, and sequences. These include basic 
arithmetic operators, functions for accessing the individ- 
ual elements of a route, and set operators such as union, 
intersection, containment, and indexing. Figure 2 shows 
an example rule. This rule says that, when exporting 
a route to AS 8 or 9, the adopting AS must advertise 
AS_PATHs of the same length over all peering points 
with that AS. 


4.3 Interval operators 


Why do rules ever have to depend on future or past 
states? The reason is that, due to propagation delays and 
clock skew, RIBs from different routers may be slightly 
out of sync. Hence, there can be short intervals during 
which a route appears in an inRIB but in none of the out- 
RIBs, or vice versa. To an auditor, this might look like a 
transient rule violation.* 

To avoid false positives in this case, we must introduce 
a bit of leeway. NetReview’s specification language con- 
tains two timing-related operators. Both operators take 
an interval J = [t — a,t + (| as an argument, where t is 
an instant in time and a and (3 specify how far the inter- 
val extends into the past and into the future, respectively. 
The union operator returns all routes that have been ad- 
vertised at some point in J, and the intersection operator 
returns all routes that have been advertised continuously 
during J. This allows us to mask transient inconsisten- 
cies. For example, we might stipulate that a route may 
only be exported if a prefix of that route was available 
within two seconds of the current time, or that a route 
must be exported to some neighbor if it has been avail- 
able for at least five seconds. We limit a and ( to 60 sec- 
onds each; thus, the auditor must remember at most two 
minutes’ worth of past or future states. 

If a rule contains interval operators, it can miss actual 
transient faults that exist for less than a+ G seconds. The 
interval needs to be no larger than the maximum propa- 


*The use of a distributed snapshot algorithm such as [6] could avoid 
this problem, but it would require changes to the ISPs’ internal route 
distribution mechanism. 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 


gation delay plus the maximum clock skew among the 
routers of an AS, so this is not a serious limitation. 


4.4 Optimizations for checking rules 


Conceptually, an auditor must evaluate each predicate 
whenever a) the routing state of the target AS changes 
due to an incoming or outgoing BGP-A message, or b) 
the value of an interval operator changes. For example, 
if a rule contains two intervals J; = |t — 5,¢ + 3] and 
I, = |t — 2,t + 3], the auditor must also evaluate each 
predicate five and two seconds before and three seconds 
after each routing change. 


In practice, we can dramatically reduce the num- 
ber of predicate evaluations using two simple optimiza- 
tions. First, since rules typically consider each pre- 
fix individually, we can often restrict universal quanti- 
fiers to the set of prefixes that are actually affected by 
a routing change during the current evaluation. This 
set of prefixes is made available in a special vari- 
able called affectedPrefixes. Second, we can 
apply some simple query optimizations. For exam- 
ple, in the rule in Figure 2, NetReview combines the 
check for prefix (route) ==cpref with the inner- 
most forall quantifier, which reduces the quantifier to 
a simple projection. 


4.5 Discussion 


Even though our specification language is very simple, 
we have found that it is sufficient to describe many of 
the routing problems that have been reported in the lit- 
erature, including origin misconfigurations [24], incor- 
rect use of communities [24], incorrect extensions of im- 
ported routes [29], route deaggregation, redistribution at- 
tacks, and inconsistent path lengths [9]. We note that 
the particular details of the language are not critical to 
NetReview; NetReview just needs a way to specify and 
check constraints on the behavior of an AS. Our language 
could easily be extended or replaced without affecting 
the rest of NetReview. 
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No origin misconfi- 
Export customer 
Honor no-advertise 


Consistent path 
length 


Backup link 





VaVpVr © outRIB(a, p, M[t — 40, t]) : (jas_path(r)| = 1 A prefix(r) € ownPrefixes) V (Sa’Sp’Ar’ € 
inRIB(a’, p’, U[t—40, t+5]): prefix(r) = prefix(r’) AstartsWith(r, r’) \(Vn € r—r’: n. © ownPrefixes) ) 
Va € customers VpVr € inRIB(a, p): ((Vn € as_path(r): n =a)=Va’ € (peers U providers) Vp’Ar’ € 
outRIB(a’, p’, U[t — 15, ¢ + 15]): prefix(r) = prefix(r’) A endsWith(r, r’)) 

VaVpVvr € inRIB(a,p,n[t — 5,t]) : 
outRIB(a’, p’): prefix(r) = prefix(r’) A getElement(as_path(r), 1) = a) 

Va € (customers U peers) VpVp’: (p = p’) V (Vr € outRIB (a, p, [t — 5,t]) dr’ € outRIB(a, p’): 
prefix(r) = prefix(r’) A |as_path(r)| = |as_path(r’)|) 

Ya € backups Va’ € (customers U peers) Vp Vr € outRIB(a’,p) : 
getElement(as_path(r), 1) = a)=(—3a” € providers Sp’Sr € inRIB(a”, p’, M[t — 5, t])) 


NO_ADVERTISE € communities(r) = (-da’dp’ar’ ¢€ 


(jas_path(r)| > 1A 


Table 1: Rules we checked in our experiments. Each rule is explained in Section 5.3. The variables a, a’ are for AS 
numbers, p, p’ are for peering points, and r,r’ are for routes. inRIB(a, p) and outRIB(a, p) stand for the sets of routes 
imported and exported, respectively, to AS a over peering point p; they can be combined with an interval operator. 
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Figure 3: AS topology in our experiments. AS 2 receives 
updates from an Internet BGP trace. 


5 Feasibility study 


In this section, our goal is to demonstrate that NetReview 
(and, more generally, the fault detection approach) is 
practical. Using a prototype implementation of Net- 
Review, we answer the following high-level questions: 


e Are NetReview’s rules expressive enough to de- 
scribe common routing problems? 


e How much storage and bandwidth is needed to 
maintain the tamper-evident logs? 


e Is fault detection feasible at Internet scale? 


5.1 Methodology 


In NetReview, all communication related to a given AS 
occur among the direct neighbors of that AS. Hence, a 


>The interval sizes we use are worst-case values for a mirroring 
monitor (mainly due to MRAI timers). Much smaller intervals would 
suffice if the monitor is attached via port replicators or BMP [31]. 





small-scale deployment is sufficient to estimate the over- 
head. However, getting even a small number of contigu- 
ous Internet ASes to deploy experimental software would 
be extremely difficult. Instead, we used software routers 
to emulate a small AS topology in the lab, but we en- 
sured that the routing table sizes and the amount of BGP 
traffic closely approximated those of real Internet ASes. 

To achieve this, we injected an Internet BGP trace 
into one of the ASes, including a checkpoint of the ini- 
tial routing table. From there, the routes were propa- 
gated to the other ASes via BGP, creating BGP traffic on 
each link and populating the other routing tables. This 
mimicked the conditions that would have occurred if our 
model topology had been part of the global Internet, so 
we could get realistic estimates for many performance 
metrics, e.g., how quickly the logs grow and how much 
time is required for checking. We found that, since the 
first trace already contained a route to each available pre- 
fix, injecting additional traces would not have increased 
the routing table sizes. 


5.2 Experimental setup 


Our NetReview prototype implements the basic system 
we have described so far, plus the additional techniques 
described later in Section 6, which enable NetReview to 
operate without a CA, in a partial deployment and with 
existing routers. These techniques add some overhead to 
our results, so the overhead of the basic algorithm would 
be lower than what we report here. 

For our experiments, we set up a synthetic network of 
35 Zebra BGP daemons [12], which form a topology of 
10 ASes (Figure 3). Our network contains a mix of AS 
types, ranging from large tier-1 ASes to small stub ASes, 
as well as both customer/provider and peering relation- 
ships. This diversity allowed us to implement and check 
a variety of different routing policies. Note that AS 8 
and AS 5 have two separate peering points, which will 
become important later. 
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For each AS, we configured a default routing policy 
that satisfies the Gao-Rexford conditions [11]. If a route 
is imported from a customer, it is exported to all neigh- 
bors; otherwise (if the route is from a peer or provider), 
it is exported only to customers. In some of our experi- 
ments, we vary this policy by injecting configuration er- 
rors or imposing additional constraints. Internally, each 
AS uses a full-mesh iBGP topology. We did not set up 
route reflectors because NetReview is oblivious to iBGP. 

We injected routing updates from a Route Views BGP 
trace [30] into AS 2. We used a 15-minute trace that 
was collected by a Zebra router at Equinix in Ashburn, 
VA, on January 27, 2008. The collecting router peers 
with nine other ASes. The trace contains 15,141 updates 
from these neighbors, and the corresponding RIB snap- 
shot contains 243,198 unique prefixes. Thus, AS 2 be- 
haved as if it were connected to the Internet in Ashburn, 
VA, and it exported a realistic set of prefixes to the other 
ASes. 

NetReview’s overhead depends in part on the number 
of neighbors an AS has. Unless otherwise noted, the 
numbers we report are for AS 5. Since 92% of Internet 
ASes have degree five or less [3], our results are repre- 
sentative of all but the largest Internet ISPs. 


5.3. Rules we checked 


In our experiments, we used NetReview to enforce five 
rules, which are shown in Table |. In plain English, these 
rules state the following: 


e No origin misconfiguration: An AS may only ex- 
port a route if it owns the corresponding IP prefix, 
or if the exported route is an extension of another 
route that the AS is currently importing (motivated 
by [24]). 


e Export customer routes: If an AS imports a direct 
route from one of its customers, it must export that 
route to its peers and providers. 


e Honor no-advertise community: An AS must 
honor the NO_-ADVERTISE community; it may not 
re-export a route that is tagged with this community. 


e Consistent path length: When exporting a route 
to a customer or a peer, an AS must advertise 
AS_PATHs of the same length at all peering points 
(motivated by [9]). 


e Backup link: An AS may only export a route via a 
backup path if its direct links become unavailable. 


We chose these five rules because they can be used to 
detect real problems that have been reported in the Inter- 
net [9, 24, 29], and because they demonstrate the differ- 
ent types of conditions NetReview can verify (of course, 
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each rule could be varied and customized in a number of 
ways). Note that the first two rules are very powerful; 
together, they can find almost all of the routing problems 
that were studied in [24]. In particular, the first rule cov- 
ers AS_PATH manipulations, which are the main focus 
of secure routing systems like S-BGP (it actually goes 
beyond S-BGP in that it can also check for timely route 
withdrawal). The last three rules catch routing problems 
that would be difficult to find without a detection system, 
since they can only be detected by combining informa- 
tion from several routers and/or ASes. 


5.4 Functionality check 


We begin with a simple functionality check to show that 
the prototype is fully functional and works as expected. 
Recall that NetReview’s design precludes false positives 
and false negatives if each AS is audited regularly. 

We ran a series of six trials. In the first trial, we 
used the correct configuration for each AS. In the fol- 
lowing five trials, we made a configuration change to a 
NetReview-enabled AS at some point during the exper- 
iment that caused one of the five rules to be violated. 
After each trial, we audited all the logs. 

As expected, NetReview did not report any problems 
during the first trial. In each of the other trials, it re- 
ported the fault we had injected. The output also in- 
cluded the time interval in which the fault appeared, as 
well as the variable assignments (prefixes, AS numbers 
etc.) for which the corresponding rule did not hold. This 
is valuable for administrators because it shows not only 
where the fault occurred (in the audited AS) but also for 
which prefix the exported paths did not have the same 
length, which peering points were affected, etc. 


5.5 Processing power 


BGP-A speakers and monitors must generate and ver- 
ify cryptographic signatures. The necessary processing 
time is a function of the number of messages they send 
and receive. In our experiment, the monitor in AS 5 
sent 1,973 BGP-A messages and received 1,579 during 
the 15-minute period. Since all messages are acknowI- 
edged, this required 3,552 signatures to be generated and 
an equal number to be validated, on average four signa- 
tures and validations per second. On a 3 GHz Pentium 4, 
a 1024-bit RSA signature can be generated and verified 
in less than 3.5ms. 

Unlike BGP messages, BGP-A messages can contain 
updates for multiple different routes, which explains why 
the number of messages is much lower than the number 
of routing changes in our BGP trace. This also limits the 
number of validations that are required when updates ar- 
rive in bursts. For example, if a router is restarted and re- 
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Figure 4: Average processing time required to check a 
rule over one second of log data (the error bars show the 
Sth and the 95th percentile). The speed is sufficient for 
checking multiple ASes in real time. 


ceives full routing tables from its neighbors, it only needs 
to check one signature per routing table. This is in con- 
trast to S-BGP [22], which needs to check a signature for 
every single route. 


Auditors need processing power to extract the routing 
state from the logs, and to check it against the specified 
rules. In our experiments, we found that the processing 
time was dominated by rule checking, which in turn de- 
pends on the number of routing changes as well as the 
complexity of the rules. Our five rules can be evaluated 
independently for each prefix, so the first optimization 
from Section 4.4 can be used. It would take more time 
to check rules that depend on a large number of differ- 
ent prefixes, but we are not aware of any useful rules that 
have this property. 


Figure 4 shows the average time required to check a 
one-second log segment against each of our five rules.° 
Our 15-minute log required 11,629 such checks, which 
took 41.5 seconds on a Pentium-4 workstation. 


In practice, the checking time would also depend on 
the number and complexity of the rules the target AS is 
revealing to the auditor. There is little published infor- 
mation about the policies used by commercial ASes, so 
we cannot say how large a ‘typical’ set of rules would be. 
We already included a generic policy rule (rule #2) in our 
set, which may be sufficient for small ASes. Even if we 
assume that a typical set contains 20 rules (four times 
the size of our set), an AS with five neighbors would still 
only need a single workstation to perform real-time au- 
diting. If an AS has more neighbors, it can spread the 
load across multiple machines, since rule checking can 
be trivially parallelized. 


©The processing time varies considerably because some one-second 
intervals contain many updates, while others contain none at all. 


5.6 Storage space 


BGP-A speakers require storage for checkpoints, the 
tamper-evident log, and for the certificates that bind each 
key to the identity of an AS. An X.509 certificate with 
1024-bit RSA keys is about 1kB. With web-of-trust sig- 
nature chains (described in Section 6.1) and a typical 
AS-path length of four, each certificate is 5kB; thus, a 
database with certificates for 40,000 ASes would require 
approximately 195 MB. 

The size of a checkpoint is dominated by the RIBs; it 
depends on the number of prefixes and peering points. 
One RIB with 244,000 prefixes and a 90-second history 
takes about 9.0 MB, so, if we conservatively assume that 
each prefix appears in every inRIB and every outRIB, a 
complete checkpoint for an AS with six peering points 
could take up to 108 MB. If the AS records one check- 
point every minute and keeps all checkpoints for one day, 
plus one checkpoint for each day of the last year, it would 
require up to 190 GB. 

In our experiment, the log grew at a rate of about 
332 kB per minute (without checkpoints). Hence, we es- 
timate that one year’s worth of log data would take about 
166 GB. The log size is also a function of the number 
of peering points and the frequency of routing changes. 
Since the log mostly contains routing updates, its growth 
rate is roughly proportional to the amount of BGP traf- 
fic an AS generates. Recall that the numbers we report 
are for an AS with five neighbors; if an AS has more 
neighbors (and thus more peering points), its storage re- 
quirements are higher. For the largest ASes (UUNet has 
2,652 neighbors), on the order of a hundred Terabytes 
of storage may be necessary to store the log for a year. 
However, the log would be distributed over thousands of 
routers. 

Auditors require no permanent storage; however, it 
makes sense for them to cache a recent checkpoint for 
each AS they are auditing, so they do not have to down- 
load one repeatedly. 


5./ Message overhead 


BGP-A speakers generate traffic for maintaining BGP-A 
sessions, for exchanging authenticators and for respond- 
ing to audits. We look at each type of traffic in turn. 

In terms of traffic, BGP sessions and BGP-A sessions 
are quite similar. If 1024-bit keys are used, a BGP-A 
message and its acknowledgment have 367 header bytes, 
while a BGP message only has 16. On the other 
hand, a BGP-A message can advertise many different 
routes, while a BGP message can only advertise one. In 
our experiment, AS 5 generated an average of 132 kB 
of BGP-A messages and acknowledgments per minute; 
these were equivalent to 135 kB of BGP messages. 
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Upon receiving a message or an acknowledgment, a 
BGP-A speaker detaches the authenticator and forwards 
it to the sender’s neighbors. With 1024-bit keys, the 
size of an authenticator is 156 bytes; in our experiment, 
AS 5’s neighbors sent 2.1 MB of AS 5’s authenticators 
over the 15-minute period. However, authenticators are 
also collected from messages read during an audit, so 
the required traffic is guadratic in the number of neigh- 
bors: each neighbor audits each message and sends the 
corresponding authenticator to each of the other neigh- 
bors. This can be a problem for large ASes (e.g. UUNet). 
Therefore, authenticators from large ASes should be sent 
to only a subset of its neighbors. This does not affect 
NetReview’s guarantees as long as the subsets used by 
all neighbors intersect in at least one diligent neighbor. 

In our experiment, all audits were incremental; the au- 
ditor transferred a full checkpoint once and then retrieved 
only the log entries that were added since the last audit. 
In the limit, the required traffic is the size of the log times 
the number of auditors, plus some overhead for headers. 

In total, AS 5 caused about 420 kbps of BGP-A traf- 
fic, including routing updates, auditing, and authentica- 
tors sent by the neighbors. This corresponds to the band- 
width of a typical DSL upstream, which is insignificant 
compared to the amount of traffic ISPs routinely handle. 


5.8 Summary 


Our experiments show that NetReview’s simple rules 
are sufficient to describe common, nontrivial routing 
problems. Also, NetReview’s resource requirements are 
moderate: in a typically-sized AS with five neighbors, 
routers must sign less than four messages per second on 
average, a single hard disk is sufficient to keep one year’s 
worth of log data, and the total traffic is less than the ca- 
pacity of a single broadband upstream link. Finally, we 
have demonstrated that fault detection is feasible at In- 
ternet update rates. By running the NetReview software 
on just a single workstation, an ISP can audit dozens of 
neighboring ASes in real time. 


6 Practical challenges 


In the previous two sections, we have shown that it is fea- 
sible to build a fault detection system with strong guar- 
antees, and that its resource requirements are moderate. 
The goal of this section is to show how NetReview deals 
with the various practical problems that have hampered 
the deployment of previous solutions. In particular, we 
will show that NetReview can operate without a CA, that 
it can be effective in a partial deployment, that it can 
initially be deployed without upgrading any routers, and 
that it offers incentives for incremental deployment. 
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6.1 NetReview without a CA 


Despite many proposals, deploying a global CA for pre- 
fixes and ASes has so far not found acceptance [19]. Net- 
Review can use such a CA if it exists, but it does not re- 
quire it. In the absence of a CA, we need to find replace- 
ments for two services that a CA provides: associating 
each key pair with a real-world identity, and certifying 
ownership of AS numbers and IP prefixes. 

We solve this problem using a web-of-trust approach 
that is inspired by [33, 34]. Each AS initially generates 
a key pair and creates a self-signed certificate. Then it 
sends the certificate to its immediate neighbors, who ap- 
pend their own endorsement and forward it on to their 
neighbors, etc. The overhead for flooding certificates is 
not a concern, because the AS topology changes slowly. 

Each AS obtains a database of all certificates, each 
with a chain of endorsements that corresponds to the 
shortest path between the local AS and the AS repre- 
sented by a given certificate. Can these certificates be 
trusted? We can safely assume that each AS knows the 
true identity of the neighbor attached to each of its phys- 
ical links. Moreover, we have assumed earlier that each 
AS has a diligent neighbor. This neighbor can detect if 
the AS signs a certificate that do not correspond to its true 
identity, or endorses a certificate that does not come from 
one of its neighbors. Thus, a node can (transitively) trust 
every certificate that is endorsed by one of its neighbors. 

In addition, we require each AS to log a public pledge 
that specifies its current ASN and prefix ownerships.’ 
ASes extract this pledge during audits and compare it to 
their database; if there is any change, they flood it to all 
other ASes. Thus, NetReview can detect if two ASes 
claim ownership of the same ASN or of overlapping pre- 
fixes, and it provides each with evidence of the other’s 
claim. The conflict can then be resolved through existing 
mechanisms, e.g., by a mediator or a judge. 


6.2 Partial deployment 


It would be unrealistic to expect that all ASes adopt Net- 
Review, much less that all ASes install the system at the 
same time. Therefore, NetReview must be able to work 
in a partial deployment, that is, it must be able to interact 
with non-participants via BGP. 

By default, BGP-A speakers and monitors record only 
BGP-A messages in their logs, and auditors use only 
BGP-A messages to reconstruct the routing information. 
However, legacy neighbors have no components that 
speak BGP-A. If we simply omitted all routes imported 
from or exported to these neighbors, the information in 
the log might not be sufficient to evaluate many interest- 


7Prefixes used for IP anycast [26] require special handling because 
they may be owned by multiple ASes simultaneously. 
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ing conditions. For example, if an AS acts as a provider 
for another AS, it may be required to export routes for all 
prefixes it knows about, even if the corresponding route 
is through a non-participant. Therefore, if an AS has 
legacy neighbors, its BGP-A speakers and monitors ad- 
ditionally record all the (unsigned) BGP messages they 
exchange with these neighbors. 

Why keep this information in the secure record if 
a faulty participant AS can simply record whatever it 
wants? There are three reasons. First, we can isolate 
non-malicious faults such as misconfigurations or hard- 
ware failures, where the faulty AS still records correct 
information. Second, even if an AS lies about the routes 
it is importing or exporting via BGP, it must lie consis- 
tently to avoid detection by the auditors. For example, if 
the AS claims to have imported a certain route via BGP, 
it must re-export that route to each participating neighbor 
if required by its peering agreement, and it cannot export 
different versions to different neighbors. 

Third, logging BGP messages enables an intermediate 
level of participation in NetReview. If a non-participant 
AS a is a neighbor of a participant AS 6, a can act as 
an auditor and compare 6’s log to the BGP messages b 
actually sent, without fully deploying fault detection it- 
self. All a needs is the NetReview auditor software and 
a current snapshot of its own BGP tables. If a finds a 
discrepancy, it can investigate it by contacting the par- 
ticipant AS 6. This option could encourage neighbors of 
participant ASes to ‘try out’ fault detection. 

Partial deployment requires an addition to the web-of- 
trust technique in Section 6.1. As long as the deployment 
is contiguous in the AS graph (which is likely if tier-1 
ASes join first), the technique works as described. When 
a second ‘island’ of participants arises, at least one mem- 
ber of each island must exchange cryptographic creden- 
tials out-of-band. These members are then considered 
NetReview neighbors (even though they do not share a 
physical link), and they forward certificates from their 
respective islands to ensure that each AS has a full set. 
To increase the chance that ASes in small islands have 
a diligent neighbor, they also collect authenticators for 
each other and periodically audit each other’s log. 


6.3. Using existing routers 


Requiring ISPs to upgrade or replace their routers to 
deploy NetReview would present a significant hurdle. 
Therefore, it 1s useful to have an intermediate solution 
that works with existing, unmodified routers. Our solu- 
tion is to run the NetReview software on ordinary work- 
stations, which we call monitors. The monitors speak 
both BGP and BGP-A; they observe all BGP traffic in- 
cident to the AS’s existing routers and maintain BGP-A 
sessions with any monitors (or native BGP-A speakers 


where available) in adjacent ASes. The monitors also 
maintain tamper-evident logs and perform all crypto- 
graphic operations. Thus, the existing routers need not 
be modified. 

There are two ways to configure a monitor [29]. A 
proxying monitor interposes on all BGP connections of 
its local AS. When it receives a BGP message from a 
local border router, it sends an equivalent BGP-A mes- 
sage to the remote BGP-A speaker (or monitor) and 
vice versa. A mirroring monitor snoops on the exist- 
ing control connections, e.g., using a port replicator, the 
BGP monitoring protocol [31], or additional BGP ses- 
sions. Whenever it sees an outgoing message on the 
legacy BGP connection, it sends a BGP-A message with 
the same information over a separate connection to the 
neighbor’s BGP-A speaker or monitor. 

Mirroring monitors are safer because the routers do 
not depend on them. If a monitor fails, the routers can 
still send or receive routing updates via BGP and nor- 
mal operation is not affected. On the other hand, mirror- 
ing monitors allow inconsistencies between the updates 
sent via BGP and BGP-A. Consider a case where a mis- 
configured or faulty router advertises some route A to its 
monitor and a different route B to the adjacent AS. The 
monitor would record route A in the tamper-evident log, 
and the AS could not be held accountable for route B. 

To address this case, mirroring monitors maintain a 
third RIB for each peering point, which we will call 
inRIB-BGP. The inRIB contains the routes advertised 
via BGP-A as before, while the inRIB-BGP contains the 
routes received over the monitor’s BGP sessions. Nor- 
mally the two are identical; the scenario described ear- 
lier would manifest itself as an inconsistency between 
inRIB and inRIB-BGP in two adjacent ASes. Thus, an 
inconsistency cannot go undetected; however, an audi- 
tor cannot decide whether an inconsistency between in- 
RIB and inRIB-BGP is caused by the audited AS or by 
its neighbor, and therefore must suspect both. Because 
BGP neighbors have a business relationship, they can be 
expected to swiftly sort out a demonstrated inconsistency 
between their advertised routes. 


6.4 Incentives for deployment 


If fault detection is to be deployed incrementally in the 
current Internet, we need good arguments to persuade 
ISPs to adopt it. Here, we present two arguments we 
believe to be compelling: ISPs can use fault detection as 
a distinguishing feature to attract more customers, and 
they can use it for root-cause analysis in the entire Inter- 
net, even in non-participating ASes. 

Market forces: The first adopters of NetReview are 
likely to be large ISPs, such as tier-1 and tier-2 ASes, 
who tend to adopt new routing technology and best prac- 
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tices early. As a result, their routing performance is of- 
ten excellent. These ASes can demonstrate their excel- 
lent performance by offering fault detection as a value- 
added service to their customers and thus distinguish 
themselves from the competition. 


Once fault detection is on the market, competitors are 
encouraged to measure up by offering the service them- 
selves. Thus, small islands of participants emerge. At 
this point, when a fault is caused by a non-participant, 
the participants can handle any complaints by proving 
that they are not the cause, and by tracing the problem to 
a non-participant just outside the island’s perimeter, who 
must then handle the complaint. This creates an incentive 
for ASes to be inside the perimeter, and thus causes the 
islands to expand and the gaps between them to shrink. 


Note that this approach works for NetReview because, 
unlike secure routing protocols like S-BGP, it is effective 
even in a small deployment of just a few ASes. 

Root-cause analysis: As an additional benefit, partic- 
ipant ASes can use the fault detection system to diagnose 
faults even if the cause is in a non-participating AS. Since 
non-participants do not sign messages, do not maintain 
tamper-evident logs, and do not reveal any rules, we can- 
not guarantee that the diagnosis will always be accurate, 
and we cannot detect certain types of faults, such as pol- 
icy violations. However, even an approximate diagnosis 
enables the AS to respond more effectively to faults. 

Since non-participants do not have tamper-evident 
logs, we cannot directly apply auditing to find faults. 
Instead, we can use the participants’ logs as a giant 
BGP looking glass that provides information about BGP 
updates from many vantage points. There are several 
proposed systems that can use this data to diagnose 
faults [10, 13, 21, 32]. In fact, because NetReview 
records a history of past states, it provides even more in- 
formation than existing systems need; this could be used 
to develop even more powerful systems. 


6.5 Accuracy in a partial deployment 


When NetReview is used for root-cause analysis in a 
partial deployment, it returns a candidate set — a set of 
ASes that could have caused the fault. The size of this 
set depends on the size of the NetReview deployment. 
To estimate this dependency, we ran a simulation based 
on CAIDA’s Internet AS topology [3], assuming a sce- 
nario in which an AS suspects that a given route has been 
spoofed. With NetReview in place, we can audit the par- 
ticipant ASes on the path and thus localize the fault to e1- 
ther a) a participant AS, b) a segment of non-participants 
between two participants, or c) the path suffix after the 
last participant. Here, we will ignore the possibility that 
the participant ASes record incorrect information. 
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Figure 5: Fault localization under partial deployment. 
Shown is the average number of ASes that could have 
caused an observed failure. 


For each deployment size, we simulated 10,000 trials 
as follows: we randomly picked an AS and calculated 
the shortest path to a random other AS using the Gao- 
Rexford conditions, then we picked a random AS on that 
path as the faulty AS and measured the number of candi- 
dates. We report averages over all 10,000 trials. 


Figure 5 shows our results for two different deploy- 
ment assumptions: either ASes deploy NetReview in ran- 
dom order, or in order of decreasing node degree. The 
first assumption is rather conservative; in practice, large 
ISPs typically run the latest routers, are among the first 
to apply best common practices and pride themselves on 
their good performance, so they are more likely to be 
early adopters. 


In both cases, the average number of candidates starts 
at four (the average AS path length on the Internet). 
However, if the ASes with the most neighbors de- 
ploy NetReview first, the average decreases much more 
rapidly and reaches perfect localization with only a 15% 
deployment. The reason is that there are only about 12- 
15 tier-1 ASes; once these have deployed NetReview, 
faults can already be localized to one half of the path. 
85% of the ASes in the Internet do not have customers of 
their own; once the other 15% participate, faults can be 
localized to one of them or to one of their customers. 

This result shows that early adopters of a fault de- 
tection system like NetReview can derive considerable 
benefits from it; a deployment that includes the 0.1% 
highest-degree ASes would already be able to double the 
accuracy of its diagnoses. In contrast, fault prevention 
systems like S-BGP are only effective when they are al- 
ready widely deployed. 


7 Future work 


In this section, we describe some advanced features that 
could be added to NetReview. 
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7.1 Simultaneously inspecting several ASes 


NetReview inspects one log at a time, which is sufficient 
to detect protocol violations and policy violations. How- 
ever, NetReview cannot detect problematic interactions 
between the policies of multiple ASes that way. An ex- 
ample is bad gadgets [14], which only arise when the 
routing policies of several ASes conflict in a circular 
fashion. To detect bad gadgets, NetReview would have 
to inspect the logs of multiple ASes simultaneously. 

Technically, it is not difficult to fetch the logs from 
multiple ASes and to evaluate rules over multiple RIBs. 
However, routing policies are typically pair-wise confi- 
dential; thus, the check would have to be performed by a 
mutually trusted auditor. An alternative method to detect 
such policy conflicts, proposed in [15], is to have ASes 
annotate BGP advertisements with a history in a manner 
that preserves the privacy of the routing policies. Be- 
cause NetReview records and publishes histories of BGP 
advertisements as part of its regular operation, this tech- 
nique can be readily applied. 


7.2 Detecting data-plane inconsistencies 


In this paper, we have focused on providing fault de- 
tection for the control plane — the BGP announcements 
ASes send to each other. However, an AS could conceiv- 
ably advertise one path in BGP and forward data pack- 
ets on another, whether inadvertently or as part of an at- 
tack. NetReview already provides two mechanisms that 
can detect inconsistencies between the control and data 
planes: (i) it offers authoritative information about the 
route advertisement in the control plane, and (11) it estab- 
lishes the secure log that could also record observations 
about the data plane. 

For example, suppose AS B advertises route “B C” 
to AS A but instead forwards A’s traffic to AS D. If D 
passively monitors the traffic received from B, D can ob- 
serve that A’s packets are misrouted. D can add this ob- 
servation to its log, and any auditors can thus obtain evi- 
dence of a data-plane inconsistency between B and D. 


7.3 Internal audits 


NetReview provides fault detection for BGP inter- 
domain routing. It does not record any intra-domain rout- 
ing messages in the tamper-evident log because these 
could reveal confidential information, such as the AS’s 
internal topology. 

However, NetReview could easily be adapted to cover 
intra-domain routing using a separate, private record. 
ASes could then perform internal audits to discover mis- 
configurations or compromised routers in their internal 
network, even when these routers have not (yet) caused a 
routing problem that would be visible to a neighbor. 


$8 Related Work 


Detection: Anomaly detection techniques [7, 18, 21, 23, 
36] use the BGP routing updates from one or more van- 
tage points to build a de facto registry of the AS topol- 
ogy and prefix ownership. They raise an alarm upon 
receiving updates that disagree with the registry. Root- 
cause analysis (RCA) algorithms analyze BGP update 
messages from multiple vantage points to identify the 
AS(es) responsible for a routing change [4, 5, 10]. In 
RCA, each vantage point identifies a set of suspect ASes, 
then the sets are correlated to determine the potential cul- 
prit(s). The accuracy of RCA depends on the number 
and location of the vantage points. Unlike both RCA and 
anomaly detection, NetReview produces no false posi- 
tives or false negatives, and it is not vulnerable to com- 
promised ASes. In addition, NetReview can detect a 
larger class of faults, and it produces evidence that can 
be used to convince a third party. 

Audit [2] can determine which ASes are losing or de- 
laying packets on the data plane. However, AudIt can 
only reveal the symptoms of a malfunctioning control 
plane, whereas control-plane fault detection can perform 
diagnosis. 

Prevention: Secure routing protocols [20, 22, 33, 34] 
can ensure that (1) a route advertisement originates from 
the legitimate origin AS and that (11) the AS-path of a 
route advertisement has not been modified or forged. On 
the one hand, secure routing protocols can prevent cer- 
tain types of faults, whereas NetReview can only detect 
them; on the other hand, NetReview covers a larger class 
of faults, including policy violations (such as a faulty AS 
redistributing routes from one upstream provider to an- 
other), it can localize faults, and it provides incentives 
to avoid them. Perhaps more importantly, secure routing 
protocols do not provide appreciable benefits until many 
(if not all) ASes have adopted them, which explains in 
part why they have not yet been deployed, whereas Net- 
Review is effective even in small deployments 

N-BGP [29] uses trusted hardware to enforce a BGP 
safety specification for individual routers. Unlike N- 
BGP, NetReview does not require trusted hardware and it 
produces evidence of faults that can be verified by third 
parties. Moreover, NetReview is designed to check an 
entire AS’s operation, not only against a safety specifica- 
tion but also against the AS’s routing policy as specified 
in its peering agreements. 

AIP [1] is a clean-slate redesign of IP that, among 
other things, would greatly simplify the deployment of 
a secure routing protocol. However, even if AIP were to 
replace IP entirely, it would be subject to the limitations 
of secure routing protocols described above. 

Accountability: NetReview’s tamper-evident log is 
based on the log in PeerReview [17], a general account- 
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ability framework for distributed systems. However, 
NetReview goes beyond PeerReview, which is based on 
assumptions that do not hold in interdomain routing. For 
example, PeerReview requires a certificate authority, it 
cannot operate in a partial deployment, it cannot protect 
the business secrets of ISPs, and it detects neither pol- 
icy violations nor any other condition that involves more 
than one router. 


9 Conclusion 


In this paper, we have presented the design, implemen- 
tation, and evaluation of NetReview, a fault detection 
system for interdomain routing. NetReview reliably de- 
tects incorrect behavior and links it to the responsible 
AS, while also enabling well-behaved ASes to prove 
they have adhered to the protocol and their routing poli- 
cies. NetReview’s correctness checks can detect and di- 
agnose a wide variety of problems in BGP, including 
faulty equipment, buggy software, policy violations, and 
malicious attacks, which makes it an appealing alterna- 
tive to specific solutions to any one of these problems. 
NetReview does not require changes to the underlying 
routers and is effective even in partial deployments. We 
believe that a fault detection system like NetReview can 
play an important role in improving the reliability, stabil- 
ity, and security of interdomain routing. 
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Abstract 


This paper presents ViAggre (Virtual Aggregation), a 
“configuration-only” approach to shrinking the routing 
table on routers. ViAggre does not require any changes 
to router software and routing protocols and can be 
deployed independently and autonomously by any ISP. 
ViAgegre is effectively a scalability technique that allows 
an ISP to modify its internal routing such that individual 
routers in the ISP’s network only maintain a part of the 
global routing table. 

We evaluate the application of ViAggre to a few tier- 
1 and tier-2 ISPs and show that it can reduce the routing 
table on routers by an order of magnitude while impos- 
ing almost no traffic stretch and negligible load increase 
across the routers. We also deploy Virtual Aggregation 
on a testbed comprising of Cisco routers and benchmark 
this deployment. Finally, to understand and address 
concerns regarding the configuration overhead that our 
proposal entails, we implement a configuration tool that 
automates ViAgegre configuration. While it remains to 
be seen whether most, if not all, of the management 
concerns can be eliminated through such automated 
tools, we believe that the simplicity of the proposal 
and its possible short-term impact on routing scalability 
suggest that it is an alternative worth considering. 


I. Introduction 


The Internet default-free zone (DFZ) routing table 
has been growing rapidly for the past few years [20]. 
Looking ahead, there are concerns that as the [Pv4 
address space runs out, hierarchical aggregation of 
network prefixes will further deteriorate resulting in a 
substantial acceleration in the growth of the routing 
table [31]. A growing IPv6 deployment would worsen 
the situation even more [29]. 

The increase in the size of the DFZ routing ta- 
ble has several harmful implications for inter-domain 
routing.! [31] discusses these in detail. At a technical 
level, increasing routing table size may drive high- 
end router design into various engineering limits. For 
instance, while memory and processing speeds might 
just scale with a growing routing system, power and heat 
dissipation capabilities may not [30]. On the business 
side, the performance requirements for forwarding while 
being able to access a large routing table imply that the 


cost of forwarding packets increases and hence, net- 
works become less cost-effective [27]. Further, it makes 
provisioning of networks harder since it is difficult to 
estimate the usable lifetime of routers, not to mention 
the cost of the actual upgrades. As a matter of fact, 
instead of upgrading their routers, a few ISPs have 
resorted to filtering out some small prefixes (mostly 
/24s) which implies that parts of the Internet may not 
have reachability to each other [19]. This suggests that 
ISPs are willing to undergo some pain to avoid the cost 
of router upgrades. 

Such concerns regarding FIB size growth, along with 
problems arising from a large RIB and the concomitant 
convergence issues, were part of the reasons that led 
a recent Internet Architecture Board workshop to con- 
clude that scaling the routing system is one of the most 
critical challenges of near-term Internet design [30]. The 
severity of these problems has also prompted a slew 
of routing proposals [7,8,11,14,18,29,32,40]. All these 
proposals require changes in the routing and addressing 
architecture of the Internet. This, we believe, is the 
nature of the beast since some of the fundamental 
Internet design choices limit routing scalability; the 
overloading of IP addresses with “who” and “where” 
semantics represents a good example [30]. However, 
the very fact that they require architectural change has 
contributed to the non-deployment of these proposals. 

This paper takes the position that a major architec- 
tural change is unlikely and it may be more pragmatic to 
approach the problem through a series of incremental, 
individually cost-effective upgrades. Guided by this and 
the aforementioned implications of a rapidly growing 
DFZ FIB, this paper proposes Virtual Aggregation or 
ViAgegre, a scalability technique that focuses primar- 
ily on shrinking the FIB size on routers. ViAggre is 
a “configuration-only” solution that applies to legacy 
routers. Further, ViAggre can be adopted independently 
and autonomously by any ISP and hence the bar for its 
deployment is much lower. The key idea behind ViAg- 
gre is very simple: an ISP adopting ViAggre divides the 
responsibility for maintaining the global routing table 
amongst its routers such that individual routers only 
maintain a part of the routing table. Thus, this paper 
makes the following contributions: 

e We discuss two deployment options through which 

an ISP can adopt ViAggre. The first one uses F/B 
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suppression to shrink the FIB of all the ISP’s routers 
while the second uses route filtering to shrink both 
the FIB and RIB on all data-path routers. 

e We analyze the application of ViAggre to an actual 
tier-1 ISP and several inferred (Rocketfuel [37]) ISP 
topologies. We find that ViAggre can reduce FIB size 
by more than an order of magnitude with negligible 
stretch on the ISP’s traffic and very little increase in 
load across the ISP’s routers. Based on predictions of 
future routing table growth, we estimate that ViAggre 
can be used to extend the life of already outdated 
routers by more than 10 years. 

e We propose utilizing the notion of prefix popularity 
to reduce the impact of ViAggre on the ISP’s traffic 
and use a two-month study of a tier-1 ISP’s traffic to 
show the feasibility of such an approach. 

e As a proof-of-concept, we configure test topologies 
comprising of Cisco routers (on WAIL [3]) according 
to the ViAggre proposal. We use the deployment 
to benchmark the control-plane processing overhead 
that ViAgegre entails. One of the presented designs 
actually reduces the amount of processing done by 
routers and preliminary results show that it can reduce 
convergence time too. The other design has high 
overhead due to implementation issues and needs 
more experimentation. 

e ViAggre involves the ISP reconfiguring its routers 
which can be a deterrent to adoption. We quantify this 
configuration overhead. We also implement a config- 
uration tool that, given the ISPs existing configuration 
files, can automatically generate the configuration 
files needed for ViAggre deployment. We discuss the 
use of this tool on our testbed. 


Overall, the incremental version of ViAggre that this 
paper presents can be seen as little more than a simple 
and structured hack that assimilates ideas from existing 
work including, but not limited to, VPN tunnels and 
CRIO [40]. We believe that its very simplicity makes 
ViAggre an attractive short-term solution that provides 
ISPs with an alternative to upgrading routers in order to 
cope with routing table growth till more fundamental, 
long-term architectural changes can be agreed upon and 
deployed in the Internet. However, the basic ViAggre 
idea can also be applied in a clean-slate fashion to 
address routing concerns beyond FIB growth. While 
we defer the design and the implications of such a 
non-incremental ViAggre architecture for future work, 
the notion that the same concept has potential both as 
an immediate alleviative and as the basis for a next- 
generation routing architecture seems interesting and 
worth exploring. 


II. ViAgegre design 
ViAgegre allows individual ISPs in the Internet’s DFZ 
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to do away with the need for their routers to maintain 
routes for all prefixes in the global routing table. An ISP 
adopting ViAggre divides the global address space into 
a set of virtual prefixes such that the virtual prefixes are 
larger than any aggregatable (real) prefix in use today. 
So, for instance, an ISP could divide the IPv4 address 
space into 128 parts with a /7 virtual prefix representing 
each part (0.0.0.0/7 to 254.0.0.0/7). Note that such a 
naive allocation would yield an uneven distribution of 
real prefixes across the virtual prefixes. However, the 
virtual prefixes need not be of the same length and 
hence, the ISP can choose them such that they contain 
a comparable number of real prefixes. 

The virtual prefixes are not topologically valid ag- 
gregates, 1.e. there is not a single point in the Internet 
topology that can hierarchically aggregate the encom- 
passed prefixes. ViAggre makes the virtual prefixes 
aggregatable by organizing virtual networks, one for 
each virtual prefix. In other words, a virtual topology is 
configured that causes the virtual prefixes to be aggre- 
gatable, thus allowing for routing hierarchy that shrinks 
the routing table. To create such a virtual network, some 
of the ISP’s routers are assigned to be within the virtual 
network. These routers maintain routes for all prefixes in 
the virtual prefix corresponding to the virtual network 
and hence, are said to be aggregation points for the 
virtual prefix. A router can be an aggregation point 
for multiple virtual prefixes and is required to only 
maintain routes for prefixes in the virtual prefixes it 1s 
aggregating. 

Given this, a packet entering the ISP’s network is 
routed to a close-by aggregation point for the virtual 
prefix encompassing the actual destination prefix. This 
aggregation point has a route for the destination prefix 
and forwards the packet out of the ISP’s network in 
a tunnel. In figure 1 (figure details explained later), 
router C is an aggregation point for the virtual prefix 
encompassing the destination prefix and B — C — D 
is one such path through the ISP’s network. 


A. Design Goals 


The discussion above describes ViAggre at a con- 
ceptual level. While the design space for organizing 
an ISP’s network into virtual networks has several 
dimensions, this paper aims for deployability and hence 
is guided by two major design goals: 


1) No changes to router software and routing protocols: 
The ISP should not need to deploy new data-plane 
or control-plane mechanisms. 

2) Transparent to external networks: An ISP’s decision 
to adopt the ViAggre proposal should not impact its 
interaction with its neighbors (customers, peers and 
providers). 
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These goals, in turn, limit what can be achieved 
through the ViAggre designs presented here. Routers 
today have a Routing Information Base (RIB) generated 
by the routing protocols and a Forwarding Information 
Base (FIB) that is used for forwarding the packets. 
Consequently, the FIB is optimized for looking up desti- 
nation addresses and is maintained on fast(er) memory, 
generally on the line cards themselves [31]. All things 
being equal, it would be nice to shrink both the RIB 
and the FIB for all ISP devices, as well as make other 
improvements such as shorter convergence time. 

While the basic ViAggre idea can be used to achieve 
these benefits (section VI), we have not been able to 
reconcile them with the aforementioned design goals. 
Instead, this paper is based on the hypothesis that 
given the performance and monetary implications of 
the FIB size for routers, an immediately deployable 
solution that reduces FIB size is useful. Actually, one of 
the presented designs also shrinks the RIB on routers; 
only components that are off the data path (i.e. route 
reflectors) need to maintain the full RIB. Further, this 
design is shown to help with route convergence time 
too. 


B. Design-I: FIB Supression 


This section details one way an ISP can deploy virtual 
prefix based routing while satisfying the goals specified 
in the previous section. The discussion below applies to 
IPv4 (and BGPv4) although the techniques detailed here 
work equally well for IPv6. The key concept behind 
this design is to operate the ISP’s internal distribution 
of BGP routes untouched and in particular, to populate 
the RIB on routers with the full routing table but to 
suppress most prefixes from being loaded in the FIB 
of routers. A standard feature on routers today is F/B 
Suppression which can be used to prevent routes for 
individual prefixes in the RIB from being loaded into 
the FIB. We have verified support for FIB suppression 
as part of our ViAggre deployment on Cisco 7300 
and 12000 routers. Documentation for Juniper [43] and 
Foundry [42] routers specify this feature too. We use 
this as described below. 

The ISP does not modify its routing setup — the 
ISP’s routers participate in an intra-domain routing 
protocol that establishes internal routes through which 
the routers can reach each other while BGP is used 
for inter-domain routing just as today. For each virtual 
prefix, the ISP designates some number of routers to 
serve aS aggregation points for the prefix and hence, 
form a virtual network. Each router is configured to 
only load prefixes belonging to the virtual prefixes it 
is aggregating into its FIB while suppressing all other 
prefixes. 

Given this, the ISP needs to ensure that packets to 


any prefix can flow through the network in spite of the 
fact that only a few routers have a route to the prefix. 
This is achieved as follows: 


— Connecting Virtual Networks. Aggregation points for 
a virtual prefix originate a route to the virtual prefix 
that is distributed throughout the ISP’s network but not 
outside. Specifically, an aggregation point advertises the 
virtual prefix to its iBGP peers. A router that is not an 
aggregation point for the virtual prefix would choose 
the route advertised by the aggregation point closest to 
it and hence, forward packets destined to any prefix in 
the virtual prefix to this aggregation point.” 


— Sending packets to external routers. When a router 
receives a packet destined to a prefix in a virtual prefix 
it is aggregating, it can look up its FIB to determine 
the route for the packet. However, such a packet cannot 
be forwarded in the normal hop-by-hop fashion since a 
router that is not an aggregation point for the virtual 
prefix in question might forward the packet back to 
the aggregation point, resulting in a loop. Hence, the 
packet must be tunneled from the aggregation point 
to the external router that was selected as the BGP 
NEXT_HOP. While the ISP can probably choose from 
many tunneling technologies, we use MPLS Label 
Switched Paths (LSPs) for such tunnels. This choice was 
influenced by the fact that MPLS is widely supported in 
routers, is used by ISPs, and operates at wire speed. Fur- 
ther, protocols like LDP [1] automate the establishment 
of MPLS tunnels and hence, reduce the configuration 
overhead. 

However, a LSP from the aggregation point to an 
external router would require cooperation from the 
neighboring ISP. To avoid this, every edge router of 
the ISP initiates a LSP for every external router it is 
connected to. Thus, all the ISP routers need to maintain 
LSP mappings equal to the number of external routers 
connected to the ISP, a number much smaller than the 
routes in the DFZ routing table (we relax this constraint 
in section IV-B). Note that even though the tunnel 
endpoint is the external router, the edge router can be 
configured to strip the MPLS label from the data packets 
before forwarding them onto the external router. This, in 
turn, has two implications. First, external routers don’t 
need to be aware of the adoption of ViAggre by the 
ISP. Second, even the edge router does not need a FIB 
entry for the destination prefix, instead it chooses the 
external router to forward the packets to based on the 
MPLS label of the packet. The behavior of the edge 
router here is similar to the penultimate hop in a VPN 
scenario and is achieved through standard configuration. 


We now use a concrete example to illustrate the flow of 
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Fig. 1. Path of packets destined to prefix 4.0.0.0/24 (or, 4/24) between 
external routers A and E through an ISP with ViAggre. Router C is 
an aggregation point for virtual prefix 4.0.0.0/7 (or, 4/7). 


packets through an ISP network that is using ViAggre. 
Figure | shows the relevant routers. The ISP is using 
/7s as virtual prefixes and router C is an aggregation 
point for one such virtual prefix 4.0.0.0/7. Edge router 
D initiates a LSP to external router E with label / 
and hence, the ISP’s routers can get to E through 
MPLS tunneling. The figure shows the path of a packet 
destined to prefix 4.0.0.0/24, which is encompassed by 
4.0.0.0/7, through the ISP’s network. The path from the 
ingress router B to the external router E comprises three 
segments: 


1) VP-routed: Ingress router B is not an aggregation 
point for 4.0.0.0/7 and hence, forwards the packet to 
aggregation point C. 

2) MPLS-LSP: Router C, being an aggregation point 
for 4.0.0.0/7, has a route for 4.0.0.0/24 with BGP 
NEXT_HOP set to E. Further, the path to router E 
involves tunneling the packet with MPLS label J. 

3) Map-routed: On receiving the tunneled packet from 
router C, egress router D looks up its MPLS label 
map, strips the MPLS header and forwards the packet 
to external router E. 


C. Design-II: Route Reflectors 


The second design offloads the task of maintaining 
the full RIB to devices that are off the data path. 
Many ISPs use route-reflectors for scalable internal 
distribution of BGP prefixes and we require only these 
route-reflectors to maintain the full RIB. For ease of 
exposition, we assume that the ISP is already using per- 
PoP route reflectors that are off the data path, a common 
deployment model for ISPs using route reflectors. 

In the proposed design, the external routers connected 
to a PoP are made to peer with the PoP’s route- 
reflector. This is necessary since the external peer may 
be advertising the entire DFZ routing table and we don’t 
want all these routes to reside on any given data-plane 
router. The route-reflector also has iBGP peerings with 
other route-reflectors and with the routers in its PoP. 
Egress filters are used on the route-reflector’s peerings 
with the PoP’s routers to ensure that a router only gets 
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routes for the prefixes it is aggregating. This shrinks 
both the RIB and the FIB on the routers. The data- 
plane operation and hence, the path of packets through 
the ISP’s network remains the same as with the previous 
design. 

With this design, a PoP’s route-reflector peers with 
all the external routers connected to the PoP. The RIB 
size on a BGP router depends on the number of peers it 
has and hence, the RIB for the route-reflectors can po- 
tentially be very large. If needed, the RIB requirements 
can be scaled by using multiple route-reflectors which 
may also be needed to provide customised routes to the 
PoP’s neighbors. Note that the RIB scaling properties 
here are better than in the status quo. Today, edge 
routers have no choice but to peer with the directly 
connected external routers and maintain the resulting 
RIB. Replicating these routers is prohibitive because of 
their cost but the same does not apply to off-path route- 
reflectors, which could even be BGP software routers. 


D. Design Comparison 


As far as the configuration is concerned, configuring 
suppression of routes on individual routers in design-I is 
comparable, at least in terms of complexity, to configur- 
ing egress filters on the route-reflectors. In both cases, 
the configuration can be achieved through BGP route- 
filtering mechanisms (access-lists, prefix-lists, etc.). 

Design-I], apart from shrinking the RIB on the 
routers, does not require the route suppression feature 
on routers. Further, as we detail in section V-B, design- 
II reduces the ISP’s route propagation time while the 
specific filtering mechanism used in design-I increases 
it. However, design-II does require the ISP’s eBGP peer- 
ings to be reconfigured which, while straightforward, 
violates our goal of not impacting neighboring ISPs. 


E. Network Robustness 


ViAggre causes packets to be routed through an 
aggregation point which leads to robustness concerns. 
When an aggregation point for a virtual prefix fails, 
routers using that aggregation point are re-routed to 
another aggregation point through existing mechanisms 
without any explicit configuration by the ISP. In case of 
design-I, a router has routes to all aggregation points for 
a given virtual prefix in its RIB and hence, when the 
aggregation point being used fails, the router installs 
the second closest aggregation point into its FIB and 
packets are re-routed almost instantly. With design- 
II, it is the route-reflector that chooses the alternate 
aggregation point and advertises this to the routers in its 
PoP. Hence, as long as another aggregation point exists, 
failover happens automatically and at a fast rate. 
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F. Routing popular prefixes natively 


The use of aggregation points implies that packets 
in ViAggre may take paths that are longer than native 
paths. Apart from the increased path length, the packets 
may incur queuing delay at the extra hops. The extra 
hops also result in an increase in load on the ISP’s 
routers and links and a modification in the distribution 
of traffic across them. 

Past studies have shown that a large majority of 
Internet traffic is destined to a very small fraction of 
prefixes [10,13,34,38]. The fact that routers today have 
no choice but to maintain the complete DFZ routing 
table implies that this observation wasn’t very useful for 
routing configuration. However, with ViAggre, individ- 
ual routers only need to maintain routes for a fraction of 
prefixes. The ISP can thus configure its ViAggre setup 
such that the small fraction of popular prefixes are in 
the FIB of every router and hence, are routed natively. 
For design-I, this involves configuring each router with 
a set of popular prefixes that should not be suppressed 
from the FIB. For design-II, a PoP’s route-reflector can 
be configured to not filter advertisements for popular 
prefixes from the PoP’s routers. Beyond this, the ISP 
may also choose to install customer prefixes into its 
routers such that they don’t incur any stretch. The rest of 
the proposal involving virtual prefixes remains the same 
and ensures that individual routers only maintain routes 
for a fraction of the unpopular prefixes. In section IV- 
B.4, we analyze Netflow data from a tier-1 ISP network 
to show that not only such an approach is feasible, it 
also addresses all the concerns raised above. 


III. Allocating aggregation points 


An ISP adopting ViAggre would obviously like to 
minimise the stretch imposed on its traffic. Ideally, an 
ISP would deploy an aggregation point for all virtual 
prefixes in each of its PoPs. This would ensure that for 
every virtual prefix, a router chooses the aggregation 
point in the same PoP and hence, the traffic stretch is 
minimal. However, this may not be possible in practice. 
This is because ISPs, including tier-1 ISPs, often have 
some small PoPs with just a few routers and therefore 
there may not be enough cumulative FIB space in the 
PoP to hold all the actual prefixes. More generally, 
ISPs may be willing to bear some stretch for substantial 
reductions in FIB size. To achieve this, the ISP needs 
to be smart about the way it designates routers to 
aggregate virtual prefixes and in this section we explore 
this choice. 


A. Problem Formulation 


We first introduce the notation used in the rest of this 
section. Let T represent the set of prefixes in the Internet 
routing table, R be the set of ISP’s routers and X is the 


set of external routers directly connected to the ISP. For 
each r € R, P,. represents the set of popular prefixes for 
router r. V is the set of virtual prefixes chosen by the ISP 
and for each v € V, ny is the number of prefixes in v. 
We use two matrices, D = (d;,;) that gives the distance 
between routers 2 and 7 and W = (w;,;) that gives the 
IGP metric for the IGP-established path between routers 
2 and 7. We also define two relations: 
— “BelongsTo” relation B: T — V such that B(p)=v if 
prefix p belongs to or is encompassed by virtual prefix 
VU. 
— “Egress” relation E: R x T— R such that E(2, p)=7 if 
traffic to prefix p from router 2 egresses at router 7. 
The mapping relation A: R — 2” captures how 
the ISP assigns aggregation points; 1e. A(r) = 
{U,...Un} implies that router r aggregates virtual 
prefixes {v,...Un}. Given this assignment, we can 
determine the aggregation point any router uses for its 
traffic to each virtual prefix. This is captured by the 
“Use” relation U: R x V — R where U(i, v) = 7 or 
router 7 uses aggregation point 7 for virtual prefix v if 
the following conditions are satisfied: 


1) v 
2) Wi,j 


€ A(j) 


< win Wee Riv € A(k) 


Here, condition 1) ensures that router 7 is an aggregation 
point for virtual prefix v. Condition 2) captures the 
operation of BGP with design-I and ensures that a router 
chooses the aggregation point that is closest in terms of 
IGP metrics.° 

Using this notation, we can express the FIB size on 
routers and the stretch imposed on traffic. 

1) Routing State. In ViAggre, a router needs to 
maintain routes to the (real) prefixes in the virtual pre- 
fixes it is aggregating, routes to all the virtual prefixes 
themselves and routes to the popular prefixes. Further, 
the router needs to maintain LSP mappings for LSPs 
originated by the ISP’s edge routers with one entry for 
each external router connected to the ISP. Hence, the 
“routing state” for the router r, hereon simply referred 
to as the FIB size (F’,.), is given by: 


f= S ny + |V| + [P| + |X| 
veEA(r) 


The Worst FIB size and the Average FIB size are 
defined as follows: 


Worst FIB size = max;erR(F;,) 
Average FIB size = S (F)/|R| 
PER, 


2) Traffic Stretch: If router i uses router k as 
an aggregation point for virtual prefix v, packets from 
router 2 to a prefix p belonging to v are routed through 
router k. Hence, the stretch (S) imposed on traffic to 
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prefix p from router 2 is given by: 


Greraeg—aj), pet] Lf), v=B) 


k = U(i,v) & j = E(k,p) 


The Worst Stretch and Average Stretch are defined 
as follows: 
Worst Stretch = 


MaxjcR,peT (Si,p) 


> (Sin) /(RI* IZ) 


1ER,pET 


Average Stretch = 


Problem: ViAggre, through the use of aggregation 
points, trades off an increase in path length for a reduc- 
tion in routing state. The ISP can use the assignment 
of aggregation points as a knob to tune this trade-off. 
Here we consider the simple goal of minimising the FIB 
Size on the ISP’s routers while bounding the stretch. 
Specifically, the ISP needs to assign aggregation points 
by determining a mapping A that 


min Worst FIB Size 
s.t. Worst Stretch < C 


where C is the specified constraint on Worst Stretch. 
Note that much more complex formulations are pos- 
sible. Our focus on worst-case metrics is guided by 
practical concerns — the Worst FIB Size dictates how 
the ISP’s routers need to be provisioned while the Worst 
Stretch characterizes the most unfavorable impact of 
the use of ViAgegre. Specifically, bounding the Worst 
Stretch allows the ISP to ensure that its existing SLAs 
are not breached and applications sensitive to increase 
in latency (example, VOIP) are not adversely affected. 


B. A Greedy Solution 


The problem of assigning aggregation points while 
satisfying the conditions above can be mapped to 
the MultiCommodity Facility Location (MCL) prob- 
lem [33]. MCL is NP-hard and [33] presents a logarith- 
mic approximation algorithm for it. Here we discuss a 
greedy approximation solution to the problem, similar 
to the algorithm in [25]. 

The first solution step is to determine that if router 
2 were to aggregate virtual prefix v, which routers can 
it serve without violating the stretch constraint. This is 
the can_serve;,, set and is defined as follows: 


{j | 7 € R, (Vp € T, B(p) = v, Eli, p) 
=k, (dj,i + di,n — dj,n) < C)} 


CAN_SETVE4,v = 


Given this, the key idea behind the solution is that 
any assignment based on the can_serve relation will 
have Worst Stretch less than C’. Hence, our algorithm 
designates routers to aggregate virtual prefixes in ac- 
cordance with the can_serve relation while greedily 
trying to minimise the Worst FIB Size. The algorithm, 
shown below, stops when each router can be served 
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by at least one aggregation point for each virtual pre- 
fix. 


W orst_F I B_Size=0 
for all r in R do 
for all v in V do 
Calculate can_serve,.» 
Sort V in decreasing order of n, 
for all v in V do 
Sort R in decreasing order of |can_serve,.»| 
repeat 
for all r in R do 
if (F;+n,) < Worst_FIB_Size then 
A[{r]=A[r] U v # Assign v tor 
ES Te # rs FIB size increases 
Mark all routers in can_serve,,, as served 
if All routers are served for v then 
break 
if All routers are not served for v then 
# Worst_F' I B_Size needs to be raised 
for all r in R do 
if v ¢ A[r] then 
#1 1s not an aggregation point for v 
A{r]=A[r] U v 
Fie Le 
Worst_FIB_Size = F, 
break 
until All Routers are served for virtual prefix v 


IV. Evaluation 


In this section we evaluate the application of ViAggre 
to a few Internet ISPs. 


A. Metrics of Interest 


We defined (Average and Worst) FIB Size and 
Stretch metrics in section III-A. Here we define other 
metrics that we use for ViAggre evaluation. 

1) Impact on Traffic: Apart from the stretch im- 
posed, another aspect of ViAggre’s impact is the amount 
of traffic affected. To account for this, we define traffic 
impacted as the fraction of the ISP’s traffic that uses 
a different router-level path than the native path. Note 
that in many cases, a router will use an aggregation 
point for the destination virtual prefix in the same PoP 
and hence, the packets will follow the same PoP-level 
path as before. Thus, another metric of interest is the 
traffic stretched, the fraction of traffic that is forwarded 
along a different PoP-level path than before. In effect, 
this represents the change in the distribution of traffic 
across the ISP’s inter-PoP links and hence, captures 
how ViAggre interferes with the ISP’s inter-PoP traffic 
engineering. 

2) Impact on Router Load: The extra hops traversed 
by traffic increases the traffic load on the ISP’s routers. 
We define the load increase across a router as the extra 
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traffic it needs to forward due to ViAggre, as a fraction 
of the traffic it forwards natively. 


B. Tier-1 ISP Study 


We analysed the application of ViAggre to a large 
tier-1 ISP in the Internet. For our study, we obtained 
the ISP’s router-level topology (to determine router set 
R) and the routing tables of routers (to determine prefix 
set T and the Egress F' and BelongsTo B relations). We 
used information about the geographical locations of 
the routers to determine the Distance matrix D such 
that d;,; 1s O if routers 7 and 7 belong to the same 
PoP (and hence, are in the same city) else d;,; is set 
to the propagation latency corresponding to the great 
circle distance between 2 and 7. Further, we did not 
have information about the ISP’s link weights. However, 
guided by the fact that intra-domain traffic engineering 
is typically latency-driven [36], we use the Distance 
matrix D as the Weight matrix W. We also obtained the 
ISP’s traffic matrix; however, in order to characterise the 
impact of vanilla ViAggre, the first part of this section 
assumes that the ISP does not consider any prefixes as 
popular. 


1) Deployment decisions: The ISP, in order to adopt 

ViAgegre, needs to decide what virtual prefixes to use 
and which routers aggregate these virtual prefixes. We 
describe the approaches we evaluated. 
— Determining set V. The most straightforward way to 
select virtual prefixes while satisfying the two condi- 
tions specified in section II is to choose large prefixes 
(/6s, /7s, etc.) as virtual prefixes. We assume that the 
ISP uses /7s as its virtual prefixes and refer to this as 
the “/7 allocation”. 

However, such selection of virtual prefixes could lead 
to a skewed distribution of (real) prefixes across them 
with some virtual prefixes containing a large number of 
prefixes. For instance, using /7s as virtual prefixes im- 
plies that the largest virtual prefix (202.0.0.0/7) contains 
22,772 of the prefixes in today’s BGP routing table or 
8.9% of the routing table. Since at least one ISP router 
needs to aggregate each virtual prefix, such large virtual 
prefixes would inhibit the ISP’s ability to reduce the 
Worst FIB size on its routers. However, as we mentioned 
earlier, the virtual prefixes need not be of the same 
length and so large virtual prefixes can be split to yield 
smaller virtual prefixes. To study the effectiveness of 
this approach, we started with /7s as virtual prefixes and 
split each of them such that the resulting virtual prefixes 
were still larger than any prefix in the Internet routing 
table. This yielded 1024 virtual prefixes with the largest 
containing 4,551 prefixes or 1.78% of the BGP routing 
table. We also use this virtual prefix allocation for our 
evaluation and refer to it as “Uniform Allocation’. 
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Fig. 2. FIB composition for the router with the largest FIB, C=4ms 
and no popular prefixes. 


— Determining mapping A. We implemented the algo- 
rithm described in section III-B and use it to designate 
routers to aggregate virtual prefixes. 

2) Router FIB: We first look at the size and the 
composition of the FIB on the ISP’s routers with a 
ViAgegre deployment. Specifically, we focus on the 
router with the largest FIB for a deployment where 
the worst-case stretch (C’) is constrained to 4ms. The 
first two bars in figure 2 show the FIB composition 
for a /7 and uniform allocation respectively. With a 
/7 allocation, the router’s FIB contains 46,543 entries 
which represents 18.2% of the routing table today. This 
includes 22,772 prefixes, 128 virtual prefixes, 23,643 
LSP mappings and O popular prefixes. As can be seen, 
in both cases, the LSP mappings for tunnels to the 
external routers contribute significantly to the FIB. This 
is because the ISP has a large number of customer 
routers that it has peerings with. 

However, we also note that customer ISPs do not 
advertise the full routing table to their provider. Hence, 
edge routers of the ISP could maintain routes advertised 
by customer routers in their FIB, advertise these routes 
onwards with themselves as the BGP NEXT_HOP and 
only initiate LSP advertisements for themselves and 
for peer and provider routers connected to them. With 
such a scheme, the number of LSP mappings that the 
ISP’s routers need to maintain and the MPLS overhead 
in general reduces significantly. The latter set of bars 
in figure 2 shows the FIB composition with such a 
deployment for the router with the largest FIB. For 
the /7 allocation, the Worst FIB size is 23,101 entries 
(9.02% of today’s routing table) while for the Uniform 
allocation, it 1s 10,226 entries (4.47%). In the rest of 
this section, we assume this model of deployment. 

3) Stretch Vs. FIB Size: We ran the assignment 
algorithm with Worst Stretch Constraint (C) ranging 
from O to 10 ms and determined the (Average and 
Worst) Stretch and FIB Size of the resulting ViAggre 
deployment. Figure 3(a) plots these metrics for the /7 
allocation. The Worst FIB size, shown as a fraction of 
the DFZ routing table size today, expectedly reduces as 
the constraint on Worst Stretch is relaxed. However, be- 
yond C=4ms, the Worst FIB Size remains constant. This 
is because the largest virtual prefix with a /7 allocation 
encompasses 8.9% of the DFZ routing table and the 
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Worst FIB Size cannot be any less than 9.02% (0.12% 
overhead is due to virtual prefixes and LSP mappings). 
Figure 3(b) plots the same metrics for the Uniform allo- 
cation and shows that the FIB can be shrunk even more. 
The figure also shows that the Average FIB Size and the 
Average stretch are expectedly small throughout. The 
anomaly beyond C=8msec in figure 3(b) results from the 
fact that our assignment algorithm is an approximation 
that can yield non-optimal results. 

Another way to quantify the benefits of ViAggre is 
to determine the extension in the life of a router with 
a specified memory due to the use of ViAggre. As 
proposed in [21], we used data for the DFZ routing 
table size from Jan’02 to Dec’07 [20] to fit a quadratic 
model to routing table growth. Further, it has been 
claimed that the DFZ routing table has seen exponential 
growth at the rate of 1.3x every two years for the past 
few years and will continue to do so [30]. We use 
these models to extrapolate future DFZ routing table 
size. We consider two router families: Cisco’s Cat6500 
series with a supervisor 720-3B forwarding engine that 
can hold upto 239K IPv4 FIB entries and hence, was 
supposed to be phased out by mid-2007 [6], though 
some ISPs still continue to use them. We also consider 
Cisco’s current generation of routers with a supervisor 
720-3BXL engine that can hold 1M IPv4 FIB entries. 
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Fig. 4. Variation of the percentage of traffic stretched/impacted and 
load increase across routers with Worst Stretch Constraint (Uniform 
Allocation) and no popular prefixes. 


For each of these router families, we calculate the year 
to which they would be able to cope with the growth 
in the DFZ routing table with the existing setup and 
with ViAgegre. Table I shows the results for the Uniform 
Allocation. 


For ViAggre, relaxing the worst-case stretch con- 
straints reduces FIB size and hence, extends the router 
life. The table shows that if the DFZ routing table were 
to grow at the aforementioned exponential rate, ViAgegre 
can extend the life of the previous generation of routers 
to 2018 with no stretch at all. We realise that estimates 
beyond a few years are not very relevant since the ISP 
would need to upgrade its routers for other reasons such 
as newer technologies and higher data rates anyway. 
However, with ViAggre, at least the ISP is not forced 
to upgrade due to growth in the routing table. 


Figure 4 plots the impact of ViAggre on the 
ISP’s traffic and router load. The percentage of traffic 
Stretched is small, less than 1% for C < 6 ms. This 
shows that almost all the traffic is routed through an ag- 
gregation point in the same PoP as the ingress. However, 
the fact that no prefixes are considered popular implies 
that almost all the traffic follows a different router-level 
path as compared to the status quo. This shows up in 
figure 4 since the traffic impacted is ~100% throughout. 
This, in turn, results in a median increase in load across 
the routers by ~39%. In the next section we discuss 
how an ISP can use the skewed distribution of traffic 
to address the load concern while maintaining a small 
FIB on its routers. 


4) Popular Prefixes: Past studies of ISP traffic pat- 
terns from as early as 1999 have observed that a small 
fraction of Internet prefixes carry a large majority of ISP 
traffic [10,13,34,38]. We used Netflow records collected 
across the routers of the same tier-1 ISP as in the 
last section for a period of two months (20% Nov’07 
to 20¢” Jan’07) to generate per-prefix traffic statistics 
and observed that this pattern continues to the present 
day. The line labeled “Day-based, ISP-wide” in figure 5 
plots the average fraction of the ISP’s traffic destined 
to a given fraction of popular prefixes when the set 
of popular prefixes is calculated across the ISP on a 
daily basis. The figure shows that 1.5% of most popular 
prefixes carry 75.5% of the traffic while 5% of the 
prefixes carry 90.2% of the traffic. 
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Fig. 5. Popular prefixes carry a large fraction of the ISP’s traffic. 


ViAggre exploits the notion of prefix popularity to 
reduce its impact on the ISP’s traffic. However, the ISP’s 
routers need not consider the same set of prefixes as 
popular; instead the popular prefixes can be chosen per- 
PoP or even per-router. We calculated the fraction of 
traffic carried by popular prefixes, when popularity is 
calculated separately for each PoP on a daily basis. This 
is plotted in the figure as “Day-based, per-PoP” and the 
fractions are even higher. 

When using prefix popularity for router configuration, 
it would be preferable to be able to calculate the popular 
prefixes over a week, month, or even longer durations. 
The line labeled “Estimate, per-PoP” in the figure shows 
the amount of traffic carried to prefixes that are popular 
on a given day over the period of the next month, 
averaged over each day in the first month of our study. 
As can be seen, the estimate based on prefixes popular 
on any given day carries just a little less traffic as when 
the prefix popularity is calculated daily. This suggests 
that prefix popularity is stable enough for ViAggre 
configuration and the ISP can use the prefixes that are 
popular on a given day for a month or so. However, we 
admit that that these results are very preliminary and we 
need to study ISP traffic patterns over a longer period 
to substantiate the claims made above. 

5) Load Analysis: We now consider the impact of 
a ViAggre deployment involving popular prefixes, 1.e. 
the ISP populates the FIB on its routers with popu- 
lar prefixes. Specifically, we focus on a deployment 
wherein the aggregation points are assigned to constrain 
Worst Stretch to 4ms, i.e. C = 4ms. Figure 6 shows 
how the traffic impacted and the quartiles for the load 
increase vary with the percentage of popular prefixes 
for both allocations. Note that using popular prefixes 
increases the router FIB size by the number of prefixes 
considered popular and thus, the upper X-axis in the 
figure shows the Worst FIB size. The large fraction 
of traffic carried by popular prefixes implies that both 
the traffic impacted and the load increase drops sharply 
even when a small fraction of prefixes is considered 
popular. For instance, with 2% popular prefixes in case 
of the uniform allocation (figure 6(b)), 7% of the traffic 
follows a different router-level path than before while 
the largest load increase is 3.1% of the original router 
load. With 5% popular prefixes, the largest load increase 
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Fig. 7. FIB size for various ISPs using ViAggre. 


is 1.38%. Note that the more even distribution of 
prefixes across virtual prefixes in the uniform allocation 
results in a more even distribution of the excess traffic 
load across the ISP’s routers — this shows up in the load 
quartiles being much smaller in figure 6(b) as compared 
to the ones in figure 6(a). 


C. Rocketfuel Study 


We studied the topologies of 10 ISPs collected as 
part of the Rocketfuel project [37] to determine the 
FIB size savings that ViAggre would yield. Note that 
the fact we don’t have traffic matrices for these ISPs 
implies that we cannot analyze the load increase across 
their routers. For each ISP, we used the assignment 
algorithm to determine the worst FIB size resulting from 
a ViAggre deployment where the worst stretch is limited 
to 5ms. Figure 7 shows that the worst FIB size is always 
less than 15% of the DFZ routing table. However, the 
Rocketfuel topologies are not complete and are missing 
routers. Hence, while the results presented here are 
encouraging, they should be treated as conservative 
estimates of the savings that ViAggre would yield for 
these ISPs. 
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Fig. 8. WAIL topology used for our deployment. All routers in the 
figure are Cisco 7300s. RR1 and RR2 are route-reflectors and are not 
on the data path. Routers R1 and R3 aggregate virtual prefix VP1 
while routers R2 and R4 aggregate VP2. 


D. Discussion 


The analysis above shows that ViAggre can signif- 
icantly reduce FIB size. Most of the ISPs we studied 
are large tier-1 and tier-2 ISPs. However, smaller tier- 
2 and tier-3 ISPs are also part of the Internet DFZ. 
Actually, it is probably more important for such ISPs 
to be able to operate without needing to upgrade to the 
latest generation of routers. The fact that these ISPs 
have small PoPs might suggest that ViAggre would 
not be very beneficial. However, given their small size, 
the PoPs of these ISPs are typically geographically 
close to each other. Hence, it is possible to use the 
cumulative FIB space across routers of close-by PoPs 
to shrink the FIB substantially. And the use of popular 
prefixes ensures that the load increase and the traffic 
impact is still small. For instance, we analyzed router 
topology and routing table data from a regional tier-2 
ISP (AS2497) and found that a ViAggre deployment 
with worst stretch less than 5ms can shrink the Worst 
FIB size to 14.2% of the routing table today. 

Further, the fact that such ISPs are not tier-1 ISPs 
implies they are a customer of at least one other ISP. 
Hence, in many cases, the ISP could substantially shrink 
the FIB size on its routers by applying ViAggre to the 
small number of prefixes advertised by their customers 
and peers while using default routes for the rest of the 
prefixes. 


V. Deployment 


To verify the claim that ViAggre is a configuration- 
only solution, we deployed both ViAggre designs on 
a small network built on the WAIL testbed [3]. The 
test network is shown in figure 8 and represents an ISP 
with two PoPs. Each PoP has two Cisco 7301 routers 
and a route-reflector.* For the ViAggre deployment, we 
use two virtual prefixes: 0.0.0.0/1 (VP1) and 128.0.0.0/1 
(VP2) with one router in each PoP serving as an 
aggregation point for each virtual prefix. Routers R1 
and R4 have an external router connected to them and 
exchange routes using an eBGP peering. Specifically, 
router R5 advertises the entire DFZ routing table and 
this is, in turn, advertised through the ISP to router R6. 
We use OSPF for intra-domain routing. Beyond this, 
we configure the internal distribution of BGP routes 
according to the following three approaches: 
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1). Status Quo. The routers use a mesh of iBGP 
peerings to exchange the routes and hence, each router 
maintains the entire routing table. 

2). Design-I. The routers still use a mesh of iBGP 
peerings to exchange routes. Beyond this, the routers 
are configured as follows: 

— Virtual Prefixes. Routers advertise the virtual prefix 
they are aggregating to their 1BGP peers. 

— FIB Suppression. Each router only loads the routes 
that it is aggregating into its FIB. For instance, router 
R1 uses an access-list to specify that only routes 
belonging to VPI, the virtual prefix VP2 itself and any 
popular prefixes are loaded into the FIB. A snippet of 
this access-list is shown below. 


' R5’s IP address is 198.18.1.200 
distance 255 196,1241.200 0.0.0.0 1 


' Don’t mark anything inside 0.0.0.0/1 
access=l3sr 1. deny 0.0.0.0 12675255.255.255 
! Don’t mark virtual prefix 128.0.0.0/1 
access-list 1 deny 0.0.0.0 128.0.0.0 

! Don’t mark popular prefix 122.1.1.0/24 
access=—list. 2 deny 122.1.12.0 0.0.0.255 

! ... other popular prefixes follow ... 

! Mark the rest with admin distance 255 
access-list 1 permit any 

Here, the distance command sets the adminis- 
trative distance of all prefixes that are accepted by 
access-list 1 to “255” and these routes are not 
loaded by the router into its FIB. 

— LSPs to external routers. We use MPLS for the 
tunnels between routers. To this effect, LDP [1] is 
enabled on the interfaces of all routers and establishes 
LSPs between the routers. Further, each edge router (R1 
and R4) initiates a Downstream Unsolicited tunnel [1] 
for each external router connected to them to all their 
IGP neighbors using LDP. This ensures that packets 
to an external router are forwarded using MPLS to 
the edge router which strips the MPLS header before 
forwarding them onwards. 

Given this setup and assuming no popular prefixes, 
routers RI and R3 store 40.9% of today’s routing table 
(107,943 prefixes that are in VPI) while R2 and R4 
store 59.1%. 

3). Design-II. The routers in a PoP peer with the route- 
reflector of the PoP and the route-reflectors peer with 
each other. External routers RI and R6 are reconfigured 
to have eBGP peerings with RRI and RR2 respectively. 
The advertisement of virtual prefixes and the MPLS 
configuration is the same as above. Beyond this, the 
route-reflectors are configured to ensure that they only 
advertise the prefixes being aggregated by a router to it. 
For instance, RR1 uses a prefix-—list to ensure that 
only prefixes belonging to VP1, virtual prefix VP2 itself 
and popular prefixes are advertised to router Rl. The 
structure of this prefix-list is similar to the access-list 
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shown above. Finally, route-reflectors use a route-map 
on their eBGP peerings to change the BGP NEXT_HOP 
of the advertised routes to the edge router that the 
external peer is connected too. This ensures that the 
packets don’t actually flow through the route-reflectors. 


A. Configuration Overhead 


A drawback of ViAggre being a “configuration-only” 
approach is the overhead that the extra configuration 
entails. The discussion above details the extra configu- 
ration that routers need to participate in ViAggre. Based 
on our deployment, the number of extra configuration 
lines needed for a router r to be configured according 
to design-I is given by (ring + ext + 2|A(r)| + |P,| +6) 
where 7;,; 18S the number of router interfaces, r..; 1s the 
number of external routers r is peering with, |A(r)| is 
the number of virtual prefixes r is aggregating and |P.,.| 
is the number of popular prefixes in r. Given the size of 
the routing table today, considering even a small fraction 
of prefixes as popular would cause the expression to be 
dominated by |P,.| and can represent a large number of 
configuration lines. 

However, quantifying the extra configuration lines 
does not paint the complete picture since given a list 
of popular prefixes, it is trivial to generate an access or 
prefix-list that would allow them. To illustrate this, we 
developed a configuration tool as part of our deployment 
effort. The tool is 334 line python script which takes as 
input a router’s existing configuration file, the list of 
virtual prefixes, the router’s (or representative) Netflow 
records and the percentage of prefixes to be considered 
popular. The tool extracts relevant information, such as 
information about the router’s interfaces and peerings, 
from the configuration file. It also uses the Netflow 
records to determine the list of prefixes to be considered 
popular. Based on these extracted details, the script 
generates a configuration file that allows the router to 
operate as a ViAggre router. We have been using this 
tool for experiments with our deployment. Further, we 
use clogin [41] to automatically load the generated 
ViAgegre configuration file onto the router. Thus, we 
can reconfigure our testbed from status quo operation 
to ViAggre operation (design-I and design II) in an 
automated fashion. While our tool is specific to the 
router vendor and other technologies in our deployment, 
its simplicity and our experience with it lends evidence 
to the argument that ViAgegre offers a good trade- 
off between the configuration overhead and increased 
routing scalability. 





B. Control-plane Overhead 


Section [V evaluated the impact of ViAggre on the 
ISP’s data plane. Beyond this, ViAggre uses control- 
plane mechanisms to divide the routing table amongst 
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the ISP’s routers — Design-I uses access-—lists and 
Design-II uses prefix—lists. We quantify the per- 
formance overhead imposed by these mechanisms using 
our deployment. Specifically, we look at the impact of 
our designs on the propagation of routes through the 
ISP. 

To this effect, we configured the internal distribu- 
tion of BGP routes in our testbed according to the 
three approaches described above. External router R5 
is configured to advertise a variable number of prefixes 
through its eBGP peering. We restart this peering on 
router RS and measure the time it takes for the routes 
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to be installed into the FIB of the ISP’s routers and 
then advertised onwards; hereon we refer to this as the 
installation time. During this time, we also measure 
the CPU utilization on the routers. We achieve this by 
using a clogin script to execute the “show process cpu” 
command on each router every 5 seconds. The com- 
mand gives the average CPU utilization of individual 
processes on the router over the past 5 seconds and we 
extract the CPU utilization of the “BGP router” process. 

We measured the installation time and the CPU 
utilization for the three approaches. For status quo and 
design-I, we focus on the measurements for router RI 
while for design-II, we focus on the measurements 
for route-reflector RR1. We also varied the number 
of popular prefixes. Here we present results with 2% 
and 5% popular prefixes. Figures 9 and 10 plot the 
installation time and the quartiles for the CPU utilization 
respectively. 

Design-I Vs Status Quo. Figure 9 shows that the 
installation time with design-I is much higher than 
that with status quo. For instance, with status quo, 
the complete routing table is transferred and installed 
on router RI in 273 seconds while with design-I and 
2% popular prefixes, it takes 487 seconds. Further, 
the design-I installation time increases significantly 
as the number of popular prefixes increases. Finally, 
figures 10(b) and 10(c) show that design-I leads to a 
very high CPU load during the transfer which increases 
as more prefixes are considered popular. This results 
from the fact that access-lists with a large number 
of rules are very inefficient and would obviously be 
unacceptable for an ISP deploying ViAggre. We are 
currently exploring ways to achieve FIB suppression 
without the use of access-list. 

Design-II Vs Status Quo. Figure 9 shows that the time 
to transfer, install and propagate routes with design-II 
is lesser than status quo. For instance, design-II with 
2% popular prefixes leads to an installation time of 
124 seconds for the entire routing table as compared 
to 273 seconds for status quo. Further, the installation 
time does not change much as the fraction of popular 
prefixes increases. Figures 10(d) and 10(e) show that the 
CPU utilization is low with median utilization being less 
than 20%. Note that the utilization shown for design-II 
was measured on route-reflector RR1 which has fewer 
peerings than router R1 in status quo. This explains the 
fact that the utilization with design-II is less than status 
quo. While preliminary, this experiment suggests that 
design-II can also help with route convergence within 
the ISP. 


C. Failover 


As detailed in section II-E, as long as alternate 
aggregation points exist, traffic in a ViAggre network is 
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automatically re-routed upon failure of the aggregation 
point being used. We measured this failover time using 
our testbed. In the interest of space, we very briefly 
summarise the experiment here. We generated UDP 
traffic between PCs connected to routers R5 and R6 
(figure 8) and then crashed the router being used as the 
aggregation point for the traffic. We measured the time 
it takes for traffic to be re-routed over 10 runs with each 
design. In both cases, the maximum observed failover 
time was 200 usecs. This shows that our designs ensure 
fast failover between aggregation points. 


VI. Discussion 


Pros. ViAggre can be incrementally deployed by an ISP 
since it does not require the cooperation of other ISPs 
and router vendors. The ISP does not need to change the 
structure of its PoPs or its topology. What’s more, an 
ISP could experiment with ViAggre on a limited scale 
(a few virtual prefixes or a limited number of PoPs) 
to gain experience and comfort before expanding its 
deployment. None of the attributes in the BGP routes 
advertised by the ISP to its neighbors are changed due 
to the adoption of ViAggre. Also, the use of ViAggre by 
the ISP does not restrict its routing policies and route 
selection. Further, at least for design-II, control-plane 
processing is reduced. Finally, there is incentive for 
deployment since the ISP improves its own capability 
to deal with routing table growth. 

Management Overhead. As detailed in section V- 
A, ViAggre requires extra configuration on the ISP’s 
routers. Beyond this, the ISP needs to make a number 
of deployment decisions such as choosing the virtual 
prefixes to use, deciding where to keep aggregation 
points for each virtual prefix, and so on. Apart from 
such one-time or infrequent decisions, ViAggre may 
also influence very important aspects of the ISP’s day- 
to-day operation such as maintenance, debugging, etc. 
All this leads to increased complexity and there is a cost 
associated with the extra management. 

In section V-A we discussed a configuration tool 
that automates ViAggre configuration. It 1s difficult to 
speculate about actual costs and so we don’t compare 
the increase in management costs against the cost of 
upgrading routers. While we hope that our tools will 
actually lead to cost savings for a ViAggre network, an 
ISP might just be inclined to adopt ViAggre because 
it breaks the dependency of various aspects of its 
operation on the size of the routing table. These aspects 
include its upgrade cycle, the per-byte forwarding cost, 
the per-byte forwarding power, etc. 

Popular Prefixes. As mentioned earlier, ViAggre rep- 
resents a trade-off between FIB shrinkage on one hand 
and increased router load and traffic stretch on the 
other. The fact that Internet traffic follows a power- 
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law distribution makes this a very beneficial trade-off. 
This power-law observation has held up in measurement 
studies from 1999 [10] to 2008 (in this paper) and 
hence, Internet traffic has followed this distribution for 
at least the past nine years in spite of the rise in 
popularity of P2P and video streaming. We believe 
that, more likely than not, future Internet traffic will be 
power-law distributed and hence, ViAggre will represent 
a good trade-off for ISPs. 

Other design points. The ViAggre proposal presented 
in this paper represents one point in the design space 
that we focussed on for the sake of concreteness. 
Alternative approaches based on the same idea include 
— Adding routers. We have presented a couple of tech- 
niques that ensure that only a subset of the routing 
table is loaded into the FIB. Given this, an ISP could 
install “slow-fat routers’, low-end devices (or maybe 
even a stack of software routers [16]) in each PoP 
that are only responsible for routing traffic destined 
to unpopular prefixes. These devices forward a low- 
volume of traffic, so it would be easier and cheaper to 
hold the entire routing table. The popular prefixes are 
loaded into existing routers. This approach can be seen 
as a variant of route caching and does away with a lot 
of deployment complexity. In fact, ViAggre may allow 
us to revisit route caching [24]. 

— Router changes. Routers can be changed to be 
ViAggre-aware and hence, make virtual prefixes first- 
class network objects. This would do away with a lot 
of the configuration complexity that ViAggre entails, 
ensure that ISPs get vendor support and hence, make 
it more palatable for ISPs. We, in cooperation with a 
router vendor, are exploring this option [15]. 

— Clean-slate ViAggre. The basic concept of virtual 
networks can be applied in an inter-domain fashion. 
The idea here is to use cooperation amongst ISPs to 
induce a routing hierarchy that is more aggregatable and 
hence, can accrue benefits beyond shrinking the router 
FIB. This involves virtual networks for individual virtual 
prefixes spanning domains such that even the RIB on 
a router only contains the prefixes it is responsible for. 
This would reduce both the router FIB and RIB and in 
general, improve routing scalability. We intend to study 
the merits and demerits of such an approach in future 
work. 


VII. Related Work 


A number of efforts have tried to directly tackle 
the routing scalability problem through clean-slate de- 
signs. One set of approaches try to reduce routing 
table size by dividing edge networks and ISPs into 
separate address spaces [7,11,29,32,40]. Our work re- 
sembles some aspects of CRIO [40] which uses virtual 
prefixes and tunneling to decouple network topology 


from addressing. However, CRIO requires adoption by 
all provider networks and like [7,11,29,32], requires a 
new mapping service to determine tunnel endpoints. 
APT [22] presents such a mapping service. Alterna- 
tively, it is possible to encode location information into 
IP addresses [8,14,18] and hence, reduce routing table 
size. Finally, an interesting set of approaches that trade- 
off stretch for routing table size are Compact Routing 
algorithms; see [26] for a survey of the area. 

The use of tunnels has long been proposed as a 
routing scaling mechanism. VPN technologies such as 
BGP-MPLS VPNs [9] use tunnels to ensure that only 
PE routers need to keep the VPN routes. As a matter of 
fact, ISPs can and probably do use tunneling protocols 
such as MPLS and RSVP-TE to engineer a BGP-free 
core [35]. However, edge routers still need to keep the 
full FIB. With ViAggre, none of the routers on the data- 
path need to maintain the full FIB. Router vendors, 
if willing, can use a number of techniques to reduce 
the FIB size, including FIB compression [35] and route 
caching [35]. Forgetful routing [23] selectively discards 
alternative routes to reduce RIB size. [2] sketches the 
basic ViAggre idea. In recent work, Kim et. al. [25] use 
relaying, similar to ViAggre’s use of aggregation points, 
to address the VPN routing scalability problem. 

Over the years, several articles have documented the 
existing state of inter-domain routing and delineated 
requirements for the future [5,12,28]; see [12] for other 
routing related proposals. RCP [4] and 4D [17] argue 
for logical centralization of routing in ISPs to provide 
scalable internal route distribution and a simplified 
control plane respectively. We note that ViAggre fits 
well into these alternative routing models. As a matter 
of fact, the use of route-reflectors in design-II is similar 
in spirit to RCSs in [4] and DEs in [17]. 


VII. Summary 


This paper presents ViAggre, a technique that can be 
used by an ISP to substantially shrink the FIB on its 
routers and hence, extend the lifetime of its installed 
router base. The ISP may have to upgrade the routers 
for other reasons but at least it is not driven by DFZ 
growth over which it has no control. While it remains to 
be seen whether the use of automated tools to configure 
and manage large ViAggre deployments can offset the 
complexity concerns, we believe that the simplicity 
of the proposal and its possible short-term impact on 
routing scalability suggest that is an alternative worth 
considering. 
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NOTES 


'Hereon, we follow the terminology used in [39] and use the 
term “routing table” to refer to the Forwarding Information Base or 


FIB, 


commonly also known as the forwarding table. The Routing 


Information Base is explicitly referred to as the RIB. 

* All other attributes for the routes to a virtual prefix are the same 
and hence, the decision is based on the IGP metric to the aggregation 
points. Hence, “closest” means closest in terms of IGP metric. 

>With design-II, a router chooses the aggregation point closest to 
the router’s route-reflector in terms of IGP metrics and so a similar 
formulation works for the second design too. 

4These are used only for the design-II deployment. We used both 
a Cisco 7301 and a Linux PC as a route-reflector. 
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Abstract 


We propose to construct routing overlay networks us- 
ing the following principle: that overlay edges should be 
based on mutual advantage between pairs of hosts. Upon 
this principle, we design, implement, and evaluate Peer- 
Wise, a latency-reducing overlay network. To show the 
feasibility of PeerWise, we must show first that mutual 
advantage exists in the Internet: perhaps contrary to ex- 
pectation, that there are not only “haves” and “have nots” 
of low-latency connectivity. Second, we must provide a 
scalable means of finding promising edges and overlay 
routes; we seek embedding error in network coordinates 
to expose both shorter-than-default “detour” routes and 
longer-than-expected default routes. 

We evaluate the cost of limiting PeerWise to mutu- 
ally advantageous links, then build the intelligent com- 
ponents that put PeerWise into practice. We design and 
evaluate “virtual” network coordinates for destinations 
not participating in the overlay, neighbor selection algo- 
rithms to find promising relays, and relay selection algo- 
rithms to choose the neighbor to traverse for a good de- 
tour. Finally, we show that PeerWise is practical through 
a wide-area deployment and evaluation. 


1 Introduction 


We propose mutual advantage as a principle for the con- 
struction of routing overlay networks: overlay edges 
should exist only between hosts that benefit from each 
other’s resources or position in the network. Hosts nego- 
tiate connections based strictly on mutual advantage, and 
overlay paths follow only these connections. 

Several distributed protocols and applications use mu- 
tual advantage as part of their design. BitTorrent [5] 
peers that download the same file trade blocks the other 
is missing. In backup systems [7], nodes store replicas 
of files for each other. Autonomous systems in the In- 
ternet negotiate peer-to-peer agreements to provide low- 
cost connectivity to each other’s customers [9]. 

Bringing mutual advantage into the design of routing 
overlays has several benefits. First, mutual advantage 
induces better cooperation among nodes. Incentives to 
participate become simpler, and long-lived, fair connec- 
tions appear. Building systems grounded in incentives 
for cooperation makes them robust to misbehavior and 
selfishness [23, 29]. Second, users could freely discrim- 


inate among the connections that they allow and would 
have the ability to explicitly say how much service they 
want to contribute. Finally, mutual advantage avoids the 
tragedy of the commons in routing overlays, when only 
a few, well-connected nodes provide transit. It keeps 
the trades of connectivity fair, in contrast to file-sharing 
where universities are net providers of content [27]. 

In this paper, we present the design, implementation, 
and evaluation of PeerWise, a latency-reducing routing 
overlay based on mutual advantage. PeerWise scalably 
discovers detour routes: “indirect” one-hop paths that 
have lower round trip latency than the “direct” path. 

In a previous paper [17], we presented ideas that sup- 
port a mutually advantageous latency-reducing overlay: 
that mutual advantage is common in the context of Inter- 
net latencies and that embedding error in network coor- 
dinate systems, such as Vivaldi [8] or GNP [20], could be 
used to scalably discover detours. However, we did not 
evaluate the potential and limitations of mutual advan- 
tage, nor did we design or implement a system to take 
advantage of the existing detour routes. In this work, 
we show that a mutually advantageous latency-reducing 
overlay is feasible and efficient, and that detours toward 
popular destinations are available. We design, imple- 
ment, and evaluate a system that finds these detours. 

We describe our contributions next. 

First, we use a measurement-driven simulation to 
show the potential of PeerWise (84). We collect two 
latency data sets to find what fraction of detours exist 
subject to the mutual advantage requirement and, inde- 
pendently, can be found by embedding error. The mu- 
tual advantage requirement reduces the number of desti- 
nations reachable via detour by approximately half, yet 
even popular websites, using content distribution ser- 
vices such as Akamai, are reachable by PeerWise-found 
detours. Only 5% of potential detours are missed by em- 
bedding error. 

We next describe the design of PeerWise in two main 
parts: mechanism (85) and policies (86). We implement a 
virtual network coordinate approach to find coordinates 
for the destinations that do not participate directly in the 
overlay. Neighbor tracking determines the set of nodes 
that are more likely to offer detours by remembering 
those neighbors with high embedding error in the coordi- 
nate space. Pairwise negotiation establishes connections 
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promising mutual benefit while the maintenance compo- 
nent ensures that each node receives approximately as 
much benefit as it provides. 

The second part of the design focuses on the decisions 
that each PeerWise node makes. We evaluate neighbor 
selection and relay selection algorithms. We show that 
coordinates can be used to choose among detours. Our 
environment is quite different from previous work on 
latency prediction using coordinates. Instead of focus- 
ing on source-to-destination, we must choose a source- 
to-relay-to-destination path based on a relay coordinate 
known to have high embedding error and a destination 
coordinate that may be stale or inaccurate. 

Finally, we describe the implementation of PeerWise 
and its evaluation on PlanetLab (87). We show that 
PeerWise nodes find detours to popular destinations, that 
these detours are stable, and that they offer significant la- 
tency reductions. Most detours last for a long time and 
are obtained using only one mutually advantageous peer- 
ing. We then show how PeerWise detours translate into 
real life and whether user applications can benefit. 


2 Related Work 


Routing overlays, such as RON [2], Detour [28], 
SOSR [11], and OverQoS [31], promise to provide more- 
reliable or faster paths through the Internet. They for- 
ward packets along links in self-constructed meshes and 
make routing decisions without support from routers or 
ASes. RON [2] builds a fully connected mesh and mon- 
itors all edges. When the direct path between two nodes 
fails or has performance problems, communication is 
established through the other overlay nodes. Nakao et 
al. [19, 18] use static AS-level topology and geographi- 
cal distance information to eliminate redundant overlay 
edges and improve scalability. Gummadi et al. show that 
all-to-all measurements are not necessary to find reliable 
paths: routing through a randomly chosen intermediary 
node is enough [11]. Similarly, we show that faster-than- 
default paths can be discovered with limited information: 
network coordinates and latencies to a few other nodes 
are sufficient. 

Various file-swarming systems [5, 15, 30] apply tit- 
for-tat-like schemes to induce cooperation among peers. 
Tit-for-tat applies when there is a mutual interest among 
peers, which is common in file swarming; for any pair 
of peers, one may have blocks the other does not. We 
show that, perhaps surprisingly, mutual interest is com- 
mon in low-latency routing in the Internet as well, and 
that locating nodes of mutual interest can be done in a 
decentralized fashion. 

The requirements imposed by PeerWise on who can 
connect to whom are reminiscent of the bilateral connec- 
tion game (BCG) [6], a special case of network forma- 
tion game. In BCG, a link between two nodes is estab- 
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Figure 1: Obtaining faster paths with PeerWise: A dis- 
covers a detour to D through B; B also finds that it can 
reach C faster if it traverses A; A and B create a mu- 
tually advantageous peering that they both use to reach 
their destinations more quickly. 


lished only with the consent of both nodes, similarly to 
PeerWise. However, nodes construct links that minimize 
the cost of reaching other participating nodes, whereas 
in PeerWise, nodes create peerings that offer detours to 
destinations that do not necessarily participate. 


3 PeerWise Philosophy 


In this section, we present an overview of PeerWise. We 
outline the two properties on which PeerWise is based: 
that mutual advantage is common in the Internet latency 
space and that network coordinate systems can help indi- 
cate detour routes. A previous paper [17] describes these 
properties in more detail. We then argue that it has the 
potential to be applied to a wide range of applications. 


3.1 Overview 

The key idea of PeerWise is that two nodes can cooper- 
ate to obtain faster end-to-end paths without either be- 
ing compelled to offer more service than they receive. 
Peers negotiate and establish pairwise connections to 
each other based strictly on mutual advantage. Figure | 
shows an example. The default Internet path between 
two nodes is the direct path. A shorter, alternate path 
having one intermediate hop is a detour, using terminol- 
ogy from Detour [28]. Node A discovers a faster path to 
D via B. However, B will not help A unless A provides 
a detour in exchange. Since there is a shorter path from 
B to C going through A, A and B can help each other 
communicate faster with their intended destinations. 


Mutual Advantage Each participant in overlay net- 
works contributes resources in exchange for the re- 
sources of others. Unfortunately, free access and unre- 
stricted demand may lead to over-utilization of certain re- 
sources, especially those of well-provisioned nodes. This 
tragedy of the commons occurs because the benefits of 
using common resources accrue to individuals, while the 
costs of exploitation are shared by the resource providers. 

Pairwise peerings based on mutual benefit offer users 
an effective way to resolve the tragedy of the commons, 
as they can freely discriminate among the connections 
they allow. However, such decentralized policy may be 
costly: if nodes accept only peerings that are mutually 
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Figure 2: Embedding three points that form a TIV into 
a metric space introduces inaccuracies. The numbers in 
parentheses represent embedding errors. 


advantageous, but mutual advantage is rare, the benefit 
of the overlay is lost. In Section 4, we show that mutual 
benefit is common and that a majority of nodes are in a 
position to both provide and receive service. 


Network Coordinates Embedding Error Measuring 
and distributing the all-pairs latencies required to find 
detours would limit the scalability of a latency-reducing 
overlay. Instead, PeerWise detects triangle inequality vi- 
olations (TIVs) and uses them to predict good detours. 

Three nodes in the Internet form a TIV when the RTT 
between two of them (the /ong side of the TIV) is greater 
than the sum of the RTTs to the third node (the short 
sides of the TIV). The left side of Figure 2 shows an ex- 
ample TIV. Pairs of nodes that are long sides in TIVs 
may benefit from detours; pairs that are short sides may 
be part of detours. 

To find TIVs scalably, PeerWise uses network coor- 
dinates. A network coordinate system associates nodes 
with points in a metric space such that the distance 
between the points estimates the real latency between 
nodes. Since TIVs are not allowed in metric spaces by 
definition, this embedding may result in high errors on 
the edges of the triangle (see Figure 2). The error for the 
long side of the TIV will be very negative, or the error 
for the sum of short sides of the TIV will be very pos- 
itive. Thus, a pair of nodes with a negative estimation 
error has a higher chance of benefiting from a shorter 
path; conversely, when the nodes have a large estimation 
error between them, they are more likely to be part of a 
shorter path for another node. 


3.2 Where does PeerWise apply? 


We expect PeerWise has the most utility for latency- 
sensitive traffic such as HT'TP HEAD requests that check 
for updates to a cached file before rendering, XML- 
RPC requests for rapid updates of existing content such 
as train status or sports scores, voice traffic relayed to 
bypass firewalls, and online games such as first-person 
shooters, whose playability hinges on low-latency up- 
dates among players [3]. Existing overlay networks 
could benefit from using PeerWise as a latency-reducing 
substrate by guiding PeerWise’s neighbor- and relay- 
selection algorithms to better suit the application’s needs. 


Because PeerWise focuses on reducing latency, it can 
find and use the low-latency paths that may not sup- 
port high-bandwidth use—that is, the low-latency paths 
that the default routing, likely tuned for high-bandwidth, 
misses. Going through a peer is likely to traverse an- 
other access link that might have low bandwidth. This 
means that bandwidth-intensive applications, such as 
video streaming, are unlikely to benefit from latency re- 
duction with PeerWise. 


4 Limitations of Mutual Advantage 
We assess the potential performance of a mutually ad- 
vantageous latency-reducing overlay. Because we re- 
strict detour paths to mutually advantageous peerings, we 
would not expect PeerWise to find the shortest detours or 
find detours to all destinations. We simulate using two la- 
tency data sets to show that nodes can find shorter paths 
to the majority of destinations for which a shorter detour 
exists, despite the requirement of mutual advantage. We 
find that mutually advantageous detours exist even for 
popular destinations hosted on many prefixes. 


4.1 Collected Data Sets 


We collected two real-world latency data sets and com- 
puted all one-hop detours between each pair of nodes. 
PW-King Data Set The first data set, PW-King, con- 
tains RTTs between 1,953 DNS servers of hosts in the 
Gnutella network. The list of hosts was gathered by 
Dabek et al. for the Vivaldi [8] project. We use King [10] 
to measure all-to-all latencies between the servers. King 
uses recursive DNS queries to estimate the propagation 
delay between two hosts as the delay between their au- 
thoritative name servers. The 1,953 servers were chosen 
for being in the same subnet as their hosts so that better- 
connected DNS servers would not influence the estimates 
of inter-client latencies. For each pair of nodes, we kept 
the median of all latencies measured at random intervals 
for a week in February 2008. Of the 1,953 servers, we 
removed 238 that appeared to experience high load dur- 
ing the measurement, as described by Dabek et al. [8]. 
A heavily-loaded DNS server can cause King to under- 
estimate latencies to other nodes, which can lead to false 
triangle inequality violations. 

Popular Destinations Data Set The second data set, 
PL-Dest, contains RTTs from 389 PlanetLab nodes to 
500 popular web servers, measured in January 2008. We 
selected the servers based on a ranking by the Alexa In- 
ternet Company [1] using expected and measured client 
access. For faster content delivery, many of the web- 
sites have multiple IP addresses; users in different geo- 
graphic regions see different IPs for the same server. To 
gather the IP addresses associated with a website, as visi- 
ble from PlanetLab, we performed DNS lookups on each 
of the 500 names from the 389 PlanetLab nodes. We ob- 
tained 2932 distinct IP addresses in 796 /24 prefixes. We 
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Figure 3: Distribution of the frac- 
tion of nodes with which a potential 
peering exists. In PL-Dest, 18% of 
nodes have no potential peerings; in 
PW-King, 50% of nodes have poten- 
tial peerings with at least 75% of the 
other nodes. 


probed each prefix and each PlanetLab node from every 
PlanetLab node at random times over a week. We used 
the median RTT values to represent the link. 

The latency collection process can produce incorrect 
data that may bias our results. We removed 52 servers 
from the final data set because we could not measure any 
RTT to them. Further, several PlanetLab nodes had very 
low latencies (< 1 ms) to most destinations. These laten- 
cies are likely caused by connection-tracking firewalls or 
“transparent” proxies near the PlanetLab nodes that gen- 
erate spoofed responses as if from the destination. We 
removed those nodes from the data set since they would 
artificially overstate the potential of PeerWise. Our final 
latency matrix contains RTT values from 325 PlanetLab 
nodes to 718 prefixes corresponding to 448 websites. 

The PW-King and PL-Dest data sets illustrate two sce- 
narios in which PeerWise can be useful. Latency reduc- 
tion on PW-King shows the potential benefit to applica- 
tions a set of peers may run, such as a distributed multi- 
player network game or VoIP application. On PL-Dest, 
reduced latency shows benefit for users accessing popu- 
lar servers that would not participate in PeerWise. 


4.2 Methodology 


We built a simulation prototype of PeerWise to study 
how well it finds detours with mutual advantage and em- 
bedding error. To find network coordinates for nodes, 
we use Vivaldi [8]. We allow each node to communicate 
with all other nodes, to better study mutual advantage in 
isolation. When requesting detours for its destinations, a 
node starts with the neighbor that has the highest embed- 
ding error [17]. We evaluate alternative relay selection 
methods in Section 6.2. 

For each pair of nodes in our data sets, we find all one- 
hop detours. We define a good detour as a detour that 
provides at least 10 ms and 10% latency reduction over 
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Figure 4: Distribution of the fraction of destinations reachable through mutu- 
ally advantageous peerings for PL-Dest data set (left), and PW-King data set 
(right). In PL-Dest, few destinations can be reached by detour at all, some 
sources need no detours, and approximately half of the detours that could 
be used are lost by the mutual advantage restriction. In PW-King, all nodes 
have many detours available, and mutual advantage 1s less costly. In both, 
embedding error finds nearly all detours. 


the direct path. We consider only good detours. This 
cutoff helps avoid impractical or dubious detours due to 
measurement error. In the PL-Dest data set, we may find 
detours by server name: The detour path may end at a 
different IP address associated with the same name. 


4.3 Mutual Advantage 


How much mutual advantage exists in our data sets? We 
define a potential peering to exist between two nodes that 
can provide a detour to each other, for at least one des- 
tination, as between A and B in Figure 1. The number 
of potential peerings for a node represents the number 
of neighbors with which the node can construct mutu- 
ally advantageous peerings. In Figure 3, we show a cu- 
mulative distribution of the fraction of nodes for which 
a potential peering exists. Each point represents a node, 
and its placement on the x-axis what fraction of the other 
nodes it shares a potential peering with. At least 50% of 
the nodes in either data set have have potential peerings 
with at least 50% of the rest of the nodes. The figure 
also shows that there is more mutual advantage in the 
PW-King data set than in PL-Dest. 


Next, we show that mutual advantage sacrifices few 
detours. We study the fraction of destinations that each 
node can reach more quickly via mutually advantageous 
peerings in Figure 4. Each graph considers four cases 
to isolate the two main potential performance sacrifices: 
the requirement of mutual advantage (that could make 
detours unavailable) and relay choice by positive embed- 
ding error (that might not find them despite being possi- 
ble). The solid line represents an unconstrained detour 
overlay. Considering mutual advantage eliminates over 
half of the potential destinations for many nodes. For 
some, mutual advantage eliminates all detours; trivially, 
these are the nodes that cannot provide service to oth- 
ers. Choosing among either set (constrained to mutual 
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Figure 5: PL-Dest: When a detour exists, density plot 
of detour path RTT versus the direct path RTT (top, the 
color of each point represents the number of TIVs with 
the corresponding direct and detour RTTs), and PDF of 
direct path RTTs (bottom). 


advantage or not) via embedding error between source 
and relay sacrifices very few detours (the corresponding 
lines are almost indistinguishable from each other in Fig- 
ure 4). Mutual advantage does not impact the latency re- 
duction to the destinations that are still reachable: only 
at most 12% of the median latency reduction is lost due 
to the requirement of mutual advantage. 


4.4 Detours to Nearby Destinations 


The destinations in PL-Dest include both regionally and 
globally popular websites. We expect that a regional 
website serves its pages from within the region of inter- 
est, so the direct path latencies to the destination from 
PeerWise nodes in that region should be small. Since 
the PlanetLab nodes are globally diverse, some “detours” 
may be for destinations unpopular in that node’s region. 
For example, detours to popular websites in China may 
be less useful for nodes in Europe or North America. In 
Figure 5, we show that latency reduction 1s not limited to 
distant destinations. Because our rule to define a “good” 
detour requires at least 10 ms of reduction, few very short 
paths are featured. However, mutually-advantageous de- 
tours are found for direct paths too short to cross the At- 
lantic or Pacific oceans (< 100 ms). 


4.5  Multiple-IP Websites 


For faster content delivery, around 20% of the popular 
websites in the PL-Dest data set are served from geo- 
graphically distributed locations. User requests are trans- 
parently directed to the geographically (or administra- 
tively) nearest IP address. 

Using the PL-Dest data set, we compute how many 
nodes can find detours to each of the 448 websites and 
plot it against the total number of /24 prefixes of each 


website. Figure 6 presents the results. Each point in the 
plot is associated with one server name. Most websites 
with IP addresses in at least two prefixes can be reached 
faster from at least one PlanetLab node. We divide the 
plot into six regions and describe each in the accompa- 
nying table. 

Figure 6 shows that PeerWise has the potential to be 
effective in reducing latency to most popular websites, 
even when they employ other latency-reducing tech- 
niques such as mirroring or DNS redirection. 


4.6 Simulation Limitations 


First, our pairwise peerings are established expecting that 
each destination will be accessed as often as any other. 
Clearly, not all destinations are equally popular, but we 
cannot estimate how often peers will use the peering. 
Our evaluation might favor VoIP applications where the 
endpoints are well distributed and no endpoint is orders 
of magnitude more popular than the others. In Section 7, 
we experiment with different access patterns, including 
random and zipf, to try to apply likely relative popularity 
models to traffic. 

Second, the latencies between DNS servers or Planet- 
Lab nodes may underestimate the latencies between end- 
hosts in the Internet. Although the latency matrix be- 
tween DNS servers and PlanetLab hosts may represent 
the locations of hosts in the coordinate space, these data 
sets may not represent the latencies seen by such hosts. 

Third, using PlanetLab nodes to reach popular destina- 
tions may raise questions about the validity of our eval- 
uation. Connecting to a commercial site via a PlanetLab 
relay may reveal detours that would not be discovered 
had the relay been on the commercial network. How- 
ever, Abilene and NLR, research networks that are part 
of Internet2, use wavelengths on fiber leased from other 
providers along rights-of-way shared with commercial 
networks. We believe that this sharing prevents research 
networks from providing an unfair advantage in latency 
reduction. We have even observed detours between Pla- 
netLab nodes—routing within the academic network is 
not so latency-optimal as to prevent detours. 

Finally, we do not model the bandwidth of the con- 
nection. Even though mutual latency reductions lead to a 
pairwise peering, limited bandwidth may prevent it from 
helping. As described in Section 3, we expect to use 
PeerWise only with latency-sensitive applications that do 
not require high bandwidth. 


5 Design, Part I: Mechanisms 


We present next the design of the PeerWise routing over- 
lay network. In this section, we focus on the key fea- 
tures of PeerWise: detour detection using network co- 
ordinates for scalability, neighbor tracking for improv- 
ing efficiency, and pairwise negotiations for fairness. In 
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Figure 6: Detours to mirrored websites: The figure presents number of nodes that find detours (c) versus number of 
prefixes for each website (p). The table describes the six regions in the figure. 


Section 6, we describe and evaluate the policies of each 
PeerWise node. We present the implementation and eval- 
uation details in Section 7. 


5.1 Virtual Network Coordinates 


Every PeerWise node must compute its own network co- 
ordinate before searching for detours. We use Vivaldi [8] 
for network coordinates. Every node maintains a set of 
neighbors that it probes periodically. It uses the round 
trip time and the network coordinate of these neighbors 
to update its own coordinate. After each probe, the node 
computes the coordinate that minimizes the squared es- 
timation error to all of its neighbors. To help the system 
converge quickly, nodes with uncertain coordinates can 
move farther with each measurement. Figure 7(a) shows 
the coordinate computation process. 

A node in PeerWise must learn the coordinates of des- 
tinations to discover long or short sides of a TIV. How- 
ever, if a destination is not participating in the overlay, it 
will not provide its own network coordinate. We there- 
fore extend Vivaldi to allow a node to compute a virtual 
network coordinate for any non-participating host. We 
refer to non-participating Internet nodes as hosts and to 
PeerWise participants simply as nodes. 

To generate virtual network coordinates for non- 
participating hosts in Vivaldi, a participating node 
chooses to become temporarily responsible for that host. 
The node runs Vivaldi on behalf of the host with one 
minor adjustment. Since the host is not participating in 
the system, it cannot manage its own neighbor set or ac- 
tively gather the round trip times needed to compute the 
coordinate. Instead, the participating node uses its own 
neighbor set as the neighbor set for the host, and requests 
that those neighbors measure the latency to the host, as 
shown in Figure 7(b). Our extensions are similar to those 
recently described by Ledlie et al. [13]. 

Requiring all nodes to compute virtual coordinates for 
all non-participating destinations would limit the scal- 
ability of PeerWise. We include a gossip mechanism 
to disseminate the calculated coordinates throughout the 
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Figure 7: (a) Computing network coordinates for a Peer- 
Wise node: A measures RTT to its neighbors and asks 
for their coordinates (1); after it receives the replies (2) it 
computes the coordinate that minimizes the squared esti- 
mation error (3); (b) Computing network coordinates for 
a non-PeerWise node D: A asks each of its neighbors (4) 
to measure RTTs to D (5,6); after it receives the replies 
from the neighbors (7), A runs the network coordinate 
algorithm on behalf of D (8). 


system. At fixed intervals (10s in our experiments), each 
node picks one of its neighbors at random, then selects 
a random destination and sends to the neighbor the IP 
address, name and virtual coordinate of the destination. 

A node decides to take responsibility for a destination 
to which it wants to find a detour when the destination’s 
coordinate does not yet exist, becomes too old (1 day in 
our experiments), or becomes unstable (where stability 
depends on the embedding error to other nodes). Any 
node can generate coordinates independently; this de- 
centralization may allow simultaneous, redundant work. 
Rather than try to enforce a single consistent view of the 
coordinate, we allow any of these coordinates to be con- 
sidered valid estimates. When a node receives a new vir- 
tual coordinate through the gossip protocol, it uses that 
new coordinate only if it is more stable and it was up- 
dated by the node responsible for it. 

Virtual network coordinates are useful if a host is pop- 
ular. If the host is not popular, a node trying to discover 
a detour to that host will need to compute its virtual co- 
ordinate. Since this requires that the node’s neighbors 
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measure the round trip time to the host, the node would 
know all three sides of the triangle, so it would trivially 
discover TIVs. However, if the node knows the virtual 
coordinate of a host already (because the host is pop- 
ular and its coordinate has been gossiped), it will only 
know the two adjacent sides of the triangle, and it will be 
able to make predictions about the third side between the 
neighbor and the destination. We evaluate these predic- 
tions in Section 6.3. 


5.2 Neighbor Tracking 


The success of our protocol depends on the ability of 
nodes to find other nodes to establish pairwise peerings. 
There are many possible relays for a node, any of which 
may have high embedding error with respect to the node. 
Recall that high embedding error for a pair of nodes indi- 
cates a higher probability that the pair is part of a detour. 
We use neighbor tracking to find the nodes that are more 
likely to offer detours. With neighbor tracking, a Peer- 
Wise node remembers extra neighbors and learns about 
good potential relays from its neighbors or from nearby 
(in latency) nodes. The neighbors in this section are not 
relays; they are only candidates for becoming so. 

When joining PeerWise, a node bootstraps its potential 
neighbor set from a known PeerWise node and uses it 
to compute its network coordinate. Once the network 
coordinate is stable, the node asks its neighbors about 
their own neighbors, remembering those nodes with high 
embedding error. For example, in Figure 8, A asks for 
the neighbor set of B, formed of B;, Bz and B3. Node A 
then computes the embedding error from itself to each of 
B,, Bs and Bs and adds those nodes to which the error 
is most positive to its neighbor list. These nodes are the 
most likely to form a short side of a TIV with A. 

For scalability, we limit the number of neighbors of 
each node. Neighbors with higher potential to offer the 
best detours replace less-efficient neighbors. We con- 
sider and evaluate different methods for ranking poten- 
tial neighbors in Section 6.1. Because PeerWise allows 
a node to exchange information about neighbors with 
neighbors, we expect each node to have ample choices. 


5.3. Pairwise Negotiation 


PeerWise nodes negotiate with their neighbors to request 
or advertise alternate routes. As discussed in Section 3, 
a detour to a destination is likely to exist if the estimated 
distance to the destination is much smaller than the mea- 
sured latency. In this case, a node asks its neighbors with 
high embedding errors whether they can offer a faster 
path (Figure 9(c)). Nodes are not limited to this simple 
strategy. In Section 6.2, we evaluate different policies for 
choosing relays and deciding whether to request detours 
for a destination. 

Actively requesting detours may be inefficient, espe- 
cially if the connection to the destination 1s short-lived. 





(a) | (b) | (c) 


Figure 8: Neighbor Tracking. (a) A chooses the neigh- 
bor to which it has the highest embedding error and re- 
quests its neighbor set; (b) A measures RTTs to each of 
the nodes received from B; (c) A adds to its neighbor set 
those nodes to which it has a positive embedding error. 


In addition, the time to find a detour may dominate the 
latency reduction achieved. To encourage fast detour dis- 
covery, PeerWise nodes also proactively advertise paths 
to popular destinations. For example, in Figure 9(d), 
node A observes that the link to node D, which may or 
may not be running PeerWise, has a high estimation er- 
ror. This means that AD may be a short side ina TIV. A 
advertises D on all other potential short sides (i.e., to all 
neighbors to which it has a high estimation error). 
Finding detours is not enough: PeerWise is based on 
mutual agreements between nodes. A sender node can 
use a detour only if the relay that offers it also finds value 
in the sender. When requesting a detour from a neighbor, 
a PeerWise node includes a list of possible destinations to 
which it has high embedding error. The path to these des- 
tinations is more likely to be part of a detour for another 
node, as described in Section 3. Requests for detours are 
accepted only when both the sender and the receiver find 
mutual advantage in forwarding each other’s traffic. 


5.4 


Each PeerWise node maintains two tables: a peering ta- 
ble and a negotiation table. The peering table tracks es- 
tablished, mutually advantageous peering relationships. 
The negotiation table is an antechamber for the peering 
table and tracks the nodes with which no peering has 
been established, but which are candidates for mutually 
beneficial peerings. Once a peering is established, the 
peer moves from the negotiation table to the peering ta- 
ble. An entry in either table is associated with a node 2 
in the system and contains 2’s IP address, network coor- 
dinate, and a history of round trip times to 2. The peering 
table adds the SLA and the utilization of the peering. 
The SLA specifies the benefit that each node is ex- 
pected to receive and offer through the peering. We allow 
different measures for the mutual benefit of a connection 
as long as the peering nodes both agree upon them. Two 
nodes can form a peering and agree that each of them 
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Figure 9: Detour Requests and Advertisements. (a) A wants to connect to destination D; (b) A discovers the network 
coordinate of D using Vivaldi or Virtual Vivaldi; (c) A requests a detour to D from the neighbor to which it has the 
highest embedding error; (d) A advertises its path to D to all neighbors that have positive embedding error to A. 


uses the other for the same number of detours. Alter- 
natively, they may decide that their benefit is measured 
in the average latency reduction obtained through each 
other. For example, in Figure 1, nodes A and B may es- 
tablish an SLA that promises an average latency reduc- 
tion of 30 ms from A to D and from B to C. In addition, 
two peers may establish an imbalanced peering, in which 
one peer benefits more than the other, if both consider the 
agreement to be fair. 

Peerings may become imbalanced in time. This hap- 
pens because latencies change due to failures or conges- 
tion, because peers do not respect the agreement, or be- 
cause they have different connection rates to their desti- 
nations. PeerWise nodes renegotiate existing peerings to 
account for latency changes and to find the best detours 
available, as we describe in Section 7. However, we do 
not monitor the byte-level usage of a peering. Our fo- 
cus is on finding and taking advantage of mutual latency 
reductions. In a previous paper [14], we describe a moni- 
toring and accounting mechanism that ensures long-lived 
and mutually advantageous peerings, even when nodes 
are selfish or traffic demands differ. 


6 Design, Part II: Policies 


PeerWise is designed to be a scalable overlay for find- 
ing low-latency detours. For scalability, each node 
must choose which neighbors to maintain peerings with, 
choose among neighbors to find a relay, and predict 
whether to seek a relay for a destination. 


PeerWise nodes must learn. Nodes compute coordi- 
nates for new destinations to help other nodes predict 
detours. Newly used relay paths can be instrumented 
so that they can be dropped if the prediction of their 
utility was incorrect or preserved if their utility is clear. 
Finally, nodes must remember a recent destinations so 
that a neighbor set can be customized to the likely traf- 
fic stream. Learned behavior will depend on practical 
deployment: for example, how frequently nodes return 
to the same latency-sensitive destination. In fact, as a 
destination is contacted again and again, PeerWise might 
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lower its standards for a “good” detour to provide im- 
proved application performance, or try reaching the des- 
tination via relays that are not obvious candidates. In 
this section, we make no assumptions about the utility of 
learned information, and instead focus on establishing a 
broad base of PeerWise connections for reaching all des- 
tinations. 


To study neighbor and relay selection algorithms, we 
collected latency measurements and coordinates for 262 
PlanetLab nodes and the 448 popular web servers. We 
considered only the PlanetLab nodes responsive at the 
time of the measurement. To gather this PL-Dest-Pyxida 
data set, we used Pyxida [24], an implementation of the 
Vivaldi coordinate system. To compute coordinates for 
the web servers, we extended Pyxida with our virtual co- 
ordinate algorithm. Every 30 seconds, for 18 hours on 
January 14, 2008, we took a snapshot containing RTT 
measurements and coordinates (virtual and non-virtual). 
We use only a subset of this data: median latency over 
the past 10 measurements, and network coordinates, all 
observed after Pyxida ran for two hours (to converge). 


6.1 Choosing Neighbors 


Each PeerWise node must be able to decide whether a 
new node would offer better detours than existing neigh- 
bors. A new neighbor may provide relays toward a region 
of coordinate space or directly to known destinations. 
Deciding upon future mutual advantage is a prediction of 
future accesses and future performance. In this section, 
we evaluate the ability of a PeerWise node to predict, 
from coordinates and measurement, whether a neighbor 
will contribute. 


If nodes were to contact only a few, known destina- 
tions, choosing neighbors would be simple: replace a 
neighbor if the new one provides a better path to an in- 
teresting destination. However, we do not expect access 
patterns to be nearly so predictable. Instead, we wish 
to determine, when a new neighbor arrives, whether it is 
likely to provide a shortcut to a useful region in coordi- 
nate space. 
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Figure 10: (left) Neighbor selection algorithms. As the number of legitimate neighbors is restricted, coverage, prox- 
imity and embedding error (for 32 or more neighbors) algorithms preserve the most detours. (right) Relay selection 
algorithms. Best detours are found through relays selected using the direct and conservative algorithms. 


We consider a few traffic-independent neighbor selec- 
tion policies, expecting that a combination of schemes 
would perform best. We separate them into two classes: 
value schemes are likely to provide the best detours, but 
may overlap; diversity schemes prefer relays that are dif- 
ferent from those already chosen. 

Value schemes include embedding error and proxim- 
ity. Embedding error prefers neighbors with the largest 
positive error in the embedding of the source to poten- 
tial neighbor edge: these nodes are likely to traverse the 
most coordinate distance with the lowest latency. Prox- 
imity prefers neighbors with the smallest absolute latency 
between the source and a potential neighbor. 

By choosing the best neighbors exclusively, a node 
may miss diversity. Coverage uses the relay’s coordinate 
and latency to determine the region in coordinate space 
that that relay covers. We split the space with a 2+-tree 
structure (for scalability) and prefer neighbors that mini- 
mize the expected detour latency to every point in space. 
Angle prefers neighbors in different directions in the co- 
ordinate space. For all pairs of potential neighbors, a 
node computes the angle between the line segments from 
itself to the neighbors, and selects the neighbors with the 
largest angles. Random chooses neighbors at random to 
provide a point of comparison. 

In Figure 10(left), we compare these neighbor selec- 
tion algorithms. We vary how many neighbors a node 
can have from | to 200. At each step, we add a new 
neighbor based on one of the five schemes. Proximity 
and coverage perform the best, but embedding error also 
performs well with 32 or more neighbors. We choose 
proximity as our primary neighbor selection metric be- 
cause it performs similarly to coverage and is easier to 
use. 


6.2 Choosing Relays 


Neighbor selection determines the set of neighbors that 
may provide a detour path. With relay selection, a node 


attempts to discover quickly the neighbor that offers 
the best detour to a specific destination. Like server- 
selection problems solved by network coordinates, re- 
lay selection seeks the shortest combination of the di- 
rect path to the relay and the predicted path between re- 
lay and destination. Over time, this performance can be 
measured, but to minimize latency, detour performance 
should be predicted. At the very least, we hope to reduce 
the number of relays that we need to simultaneously con- 
tact to find a good detour when contacting a destination 
for the first time. 

We consider the following policies for choosing re- 
lays for a destination. Direct prediction adds the mea- 
sured source-to-relay latency to the estimated relay-to- 
destination distance in coordinate space, then chooses 
the relay with the lowest sum. Because latency measure- 
ments may be more reliable than coordinates, we evalu- 
ated a conservative prediction, which adds the source-to- 
relay latency measurement again to increase its influence 
in the prediction. This is based on the expectation that 
coordinates are inaccurate and seeks greater likelihood 
of a good detour in preference to the best detour at the 
top of the list. A high-risk scheme chooses the neigh- 
bor with the highest embedding error. Finally, random 
provides a baseline. 

We select 32 neighbors for each node using the 
proximity-based algorithm and evaluate the four relay- 
selection algorithms. In Figure 10(right), we show the 
quality of predictions made using these algorithms in 
terms of relative performance lost compared to the best 
choice. The conservative approach performs best: ap- 
proximately 80% of the detours chosen are only 20% 
longer than the best detour between the same pair of 
nodes. 


6.3. Deciding Whether to Relay 


Deciding whether to use a detour depends on a predic- 
tion of whether it will improve application performance. 
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This has two components: whether the traffic is sensi- 
tive to latency and whether a known neighbor is likely to 
provide a detour path. We evaluate the latter. Whether 
traffic is latency sensitive can be crudely inferred by 
ports, by commercial packet scheduling products, or by 
application-based proxies that can differentiate classes of 
traffic. In this section, we assume that the traffic is la- 
tency sensitive and attempt to predict whether to relay. 
The decision of whether to relay depends first on 
whether virtual coordinates for the relay are available and 
recent. If there are no coordinates available for the des- 
tination, a node may choose to seek a relay by probing. 
If there are coordinates for the new destination, it may 
speculatively use a predicted relay, collect more infor- 
mation, or go directly to the destination without probing. 


6.3.1 If the destination has no coordinates 


If the destination lacks coordinates, the node should for- 
ward the packet directly, and if the destination is some- 
what distant, i.e., latency is long enough that a good de- 
tour is possible, the node may trigger latency probing 
from neighbors. The latency measurements by neighbors 
will, first, allow coordinates to be estimated and, second, 
provide direct latency measurements of the potential de- 
tour paths. Conveniently, if a detour path is available, 
the node may learn about it before the end of the second 
round trip (by starting the latency probing as soon as 10 
ms have elapsed in the first contact). 

The distance to the destination may be an indicator 
of whether the destination has a detour. In Figure 11, 
we show how often a destination has a relay within the 
neighbor set, given that the latency to the destination is 
above some value. For 95% of the edges, as the latency 
increases, so does the probability of a detour for the edge. 
The plot suggests that, after sending a probe to the desti- 
nation, the longer a node waits to receive a response, the 
more likely it is that a detour exists for that destination. 
For 15% of destinations (between 236 ms and 1054 ms 
of latency), there is more than a 50% chance that a detour 
exists. We expect that actual node behavior, in terms of 
when to seek out a detour, will be application dependent. 
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Figure 11: As the latency to a destination increases, so 
does the probability that there is a detour. 
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probing | probing | probing | probing 
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55.8% 57.3% 20.3% 18.8% 
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316% 


Table 1: Using coordinates alone or coordinates with 
a latency probe to the destination, nodes can predict 
whether to use PeerWise. Probing the destination slightly 
increases the probability of making a correct decision. 





For instance, a node may always try to find a detour for 
frequently contacted destinations. 

6.3.2 If the destination has coordinates 

If the destination has known coordinates that have been 
gossiped, a node can decide before sending the first 
packet: is there likely to be a detour among its neigh- 
bors? Assuming that all coordinates are accurate, except 
for the measured latencies to neighbors, the node can find 
a shortcut without direct contact to the destination. 

For certain uses of PeerWise, getting the relay right 
before contacting a destination is useful. If the desti- 
nation will be reached with a TCP connection, the first 
choice can stick: the source address on the SYN packet 
is fixed, and the connection cannot be easily migrated 
to a relay. For interactive applications over long TCP 
connections—shell, game, chat, perhaps voice—this de- 
cision may be important. 

We show that, most of the time, when the coordinates 
of the destination are known, a node makes the correct 
decision on whether to use PeerWise. We define a correct 
decision as finding a good relay (within 25% of the best 
latency reduction) when a detour exists, or not attempt- 
ing to find one when a detour does not exist. All other 
decisions of a node (i.e., attempting to find a relay when 
a detour does not exist or finding a bad relay) are con- 
sidered incorrect. We summarize all possible situations 
in Table 1. We used the proximity policy for neighbor 
selection and the conservative policy for relay selection. 
Using coordinates alone, nodes make a correct decision 
63.1% of the time. The prediction accuracy improves to 
68.4% if the latency to the destination is known. We con- 
sider the frequency of correct and incorrect decisions to 
be acceptable; a more ambitious node might try to dis- 
cover detours more often at the expense of making more 
mistakes. 


7 Implementation and Evaluation 

We implement PeerWise and run it under real network 
conditions on PlanetLab. In this section, we briefly de- 
scribe our implementation, then show that this imple- 
mentation can quickly find mutually advantageous de- 
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Figure 12: Fraction of the popular destinations reachable 
through mutually advantageous detours from PlanetLab. 


tours that offer significant and continuous latency reduc- 
tion. We then confirm that PeerWise detours can speed 
short web transfers in practice. 


7.1 Implementation 


We divide the functionality of PeerWise into two parts: 
the network coordinate system and a stand-alone dae- 
mon that includes all other components described in Sec- 
tion 5. We use Pyxida [24] for computing coordinates, 
since it is the only network coordinate system imple- 
mentation we are aware of that is tested extensively un- 
der realistic network conditions [12]. Pyxida is written 
in Java and uses the Vivaldi algorithm [8] to compute 
coordinates for nodes. Each Pyxida node maintains a 
variable number of neighbors, updated constantly, and 
probes them at regular intervals. We augmented Pyxida 
to compute virtual coordinates for hosts that do not par- 
ticipate as described in Section 5.1. 

We wrote the PeerWise daemon in approximately 
3,000 lines of Ruby. The daemon listens for connections 
from other PeerWise nodes, and negotiates, establishes, 
and maintains mutually advantageous peerings. It com- 
municates with Pyxida regularly, using RPC over TCP, 
to update the measured latencies and coordinates of the 
current set of neighbors as well as of the destinations that 
are currently served. By relying on the latency measure- 
ment and coordinate computation performed by Pyxida, 
we minimize the communication overhead. On the av- 
erage, every node consumes less than 1KB/s (including 
Pyxida traffic). 


7.2 Finding Detours 


We ran PeerWise on 189 PlanetLab nodes, chosen for 
their stability, in September 2008. We focus on what de- 
tours PeerWise can find, where a detour is determined 
by the pings not by actual transfers. We express mutual 
advantage between two nodes as the number of detours 
that each offers the other. We experimented with three 
scenarios: 


USENIX Association 


e All-dest: Each node tries to find detours to all 500 
popular websites (described in Section 4) to which 
it can measure an RTT. 

e Rand-dest: Each node tries to find detours to a ran- 
dom subset of the 500 websites. 

e Zipf-dest: The popularity of destinations follows a 
Zipf distribution. 

Our discussion focuses on the All-dest experiment, but 
we summarize the results from Rand-dest and Zipf-dest 
in Table 2. Recall that the destinations are already very 
popular servers, many of which use content distribution. 
Therefore, All-dest is not a best case scenario. 

We describe the behavior of each node next. Nodes 
start looking for detours, after their network coordinates 
have stabilized, by successively sending detour requests 
to their neighbors. We limit the number of neighbors of 
each node to 32 for scalability and use the proximity pol- 
icy for selecting neighbors. We make sure that no two 
detour requests are simultaneous: a new request is sent 
only when a reply (either positive or negative) has ar- 
rived for a previous one or a timeout has occurred. Each 
request tries to find detours to as many destinations as 
possible. Requests are sent continuously, even to the 
nodes with which peerings have been established or to 
the nodes that, in the past, could not offer detours. In this 
way, we are constantly renegotiating the peerings and are 
always ready to adapt to changes in latency. 

PeerWise relies on the latency measurements and co- 
ordinate computations performed by Pyxida. We update 
both every 10 minutes. To avoid instability due to vary- 
ing latencies, the updated values for latencies represent 
moving medians across the last 10 samples collected. 

We present results for the first 36 hours of the experi- 
ment, counting from the time when nodes start request- 
ing detours. For ease of exposition and to study startup 
behavior, all nodes start requesting detours simultane- 
ously. We show that most nodes find mutually advan- 
tageous detours and that these detours lead to significant 
and stable latency reductions. 

7.2.1 PeerWise finds detours 

For each node, we count the destinations that can be 
reached using a mutually advantageous detour for the du- 
ration of the experiment. Figure 12 shows the distribu- 
tion of the fraction of reachable destinations. Focus only 
on the line labeled “max” for now. Each point corre- 
sponds to a node, and its projection on the horizontal axis 
represents the fraction of destinations for which the node 
finds detours. Around 25% of the nodes cannot find any 
detours, while most nodes find detours to at least 10% of 
the popular destinations. Our results are consistent with 
those of the evaluation in Section 4 (see Figure 4). For 
Rand-dest and Zipf-dest, fewer nodes (around 50%) are 
able to find detours at all. This is because the number of 
destinations is much smaller than in All-dest. 
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Table 2: Characteristics of PeerWise detours: latency reduction, longevity and variability. 


7.2.2 PeerWise finds detours quickly 

How quickly are the detours discovered? We compute 
the fraction of destinations to which a detour is discov- 
ered by PeerWise within the first 10 minutes, | hour and 
5 hours. Figure 12 shows the results as cumulative dis- 
tributions. Many detours are discovered within the first 
10 minutes of the experiment and the majority after less 
than an hour. Fewer and fewer detours are discovered 
afterward. These are mostly the detours that appear due 
to varying latencies—they are discovered because Peer- 
Wise constantly adapts to new latencies and coordinates. 
7.2.3. PeerWise offers significant latency reduction 
The detours discovered by PeerWise would not be very 
useful if they offered minimal latency reductions com- 
pared to the direct paths. We show that this is not the 
case. Recall that we have set a threshold: we consider 
only those detours that offer reduction of more than 10 
ms and 10% of the direct-path latency. Here we focus on 
the latency reductions negotiated by PeerWise. In Sec- 
tion 7.3, we show how these reductions hold when user 
traffic traverses the detour path. 

We compute all latency reductions for each (source, 
destination) pair for which a detour exists, both as ab- 
solute (milliseconds) and relative (fraction of the direct 
path latency) values. We show the median, 10th and 90th 
percentiles in Table 2. The median latency reduction is 
29 ms or 26% of the latency of the direct path. 10% of 
the pairs have a reduction of more than 131 ms. This 
is caused by unusually high direct-path latencies, possi- 
bly due to traffic shaping. By circumventing these slow 
links, PeerWise can offer significant latency reduction. 
7.2.4 Longevity and variability 
PeerWise nodes may offer continuous latency reduction 
to a destination using several peerings. For each (source, 
destination) pair, we evaluate how long PeerWise offers 
reduction and with how many different relays. Ideally, 
every destination will be reached continuously through 
the same peering. Long-lived reductions through the 
same peering offer nodes more choices in when to use 
the mutually advantageous connection. 

We consider two metrics: longevity and variability. 
Longevity captures how PeerWise nodes maintain la- 
tency reduction once a detour is discovered. We define 
the longevity of a destination D from a node S as the 
fraction of time that PeerWise offers S a detour to D, af- 
ter PeerWise first learns about a shorter path from S to 
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D. A longevity of 1 for the pair (S, D) means that, af- 
ter PeerWise discovers the first detour between S and D, 
it will always offer some detour between S and D. Vari- 
ability represents the number of different relays that S 
uses to obtain continuous reduction to D. The lower the 
variability, the easier it is to maintain latency reduction. 

Table 2 summarizes longevity and variability for all 
(source, destination) pairs for which PeerWise offers la- 
tency reduction. For All-Dest, more than half of the pairs 
have a longevity higher than 0.9. 67% of the pairs use 
only one relay. When fewer destinations are selected at 
random or using a Zipf distribution, the number of de- 
tours, their longevity, and variability are reduced. How- 
ever, about half of the (source, destination) pairs still 
have longevity higher than 0.5 and variability of 1. 


7.3 Using Detours 


We show how the detours discovered by PeerWise trans- 
late in real life. Can user-level applications benefit from 
the network-level detours of PeerWise? From each Pla- 
netLab node running PeerWise, we download the front 
page of each of the 500 popular websites to which a 
mutually-advantageous detour exists. We use wget to 
perform two transfers every time it is called: one using 
the direct path and one using the PeerWise detour. To 
make the web request follow the detour path, we install 
the tinyproxy HTTP proxy on every PlanetLab node that 
can be used as arelay. We run each transfer 100 times, al- 
ternating whether detour or direct comes first, and record 
the individual completion times. 

We verify whether the detours promised by PeerWise 
are seen by the web transfers. For each (source, desti- 
nation) pair with a detour in PeerWise, we compute the 
wget reduction ratio—the ratio between the median relay 
transfer time and the median direct transfer time—and 
plot it against the PeerWise reduction ratio—the latency 
reduction ratio promised by PeerWise. Figure 13(left) 
presents the results. For 58% of the pairs, the wget reduc- 
tion is less than 1; web transfers take less time through 
the relay than through the direct path, as predicted by 
PeerWise. However, many PeerWise detours do not ma- 
terialize for the wget transfers. 

We explain the dissonance between the PeerWise view 
and the application view next. PeerWise detours are de- 
termined by network-level pings. On the other hand, the 
wget end-to-end latency includes server and proxy wait 
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Figure 13: (left) Wget latency reduction versus PeerWise latency reduction: 58% of all PeerWise detours achieve 
latency reduction in real life. (right) Distributions of average server wait times, relay times, and difference between 
wget and PeerWise RTTs for all detour transfers. Relay times inflate application latencies the most. 


times and thus may be larger than network latency. Fur- 
ther, PeerWise detours are based on medians of latencies 
gathered over long periods of time. Due to potential la- 
tency variations, these medians may differ from the RTTs 
at the time of the transfer. 

To quantify the factors that inflate the application la- 
tency, we instrument our experiment as follows. During 
the web transfers, we run tcpdump on every relay node 
and log all proxy traffic. Using the packet timestamps, 
we compute, for each detour transfer, the network latency 
(from the TCP connection setup), the time spent at the re- 
lay and the time waiting for the server. Figure 13(right) 
shows the distributions of average server time, relay time 
and of the difference between network latency at transfer 
time and latency promised by PeerWise. The time spent 
at the relay and at the server accounts for most of the in- 
flation in application latency: half of the relays induce 
an additional average latency of at least 50 ms. PeerWise 
predicts the network part of the wget transfer time well. 

All relays are PlanetLab nodes; PlanetLab does not al- 
ways reflect the realities of the Internet. We believe that 
the slowness of PlanetLab is the main factor that con- 
tributes to the unusually high relay time for our transfers. 
To confirm, we set up tinyproxy on a computer with min- 
imal load, located at University of Maryland and run web 
transfers through it. The average relay time for all trans- 
fers through the UMD proxy is 5ms, less than 95% of 
all PlanetLab relays. If we consider the hypothetical sit- 
uation in which all PlanetLab relay times were replaced 
by the average UMD relay time—effectively minimizing 
the time spent by a transfer at the relay node—then 78% 
of our web transfers would see the detours promised by 
PeerWise. We conclude that PeerWise has the potential 
to improve application performance. 


$ Discussion 


We discuss some of the implications that wide adoption 
of PeerWise would have for both ISPs and users. 


8.1 Implications for ISPs 


Overlay networks violate routing polices. How then 
would inter-domain routing policy and traffic engineer- 
ing practices coexist with widespread PeerWise deploy- 
ment? Routing overlay networks enable rule violations: 
customers and peers provide transit, and selfish rout- 
ing [25] can subvert traffic engineering decisions. We 
discuss each in turn. 

Customers provide transit, which is forbidden in inter- 
domain routing [9]. Even when a detour AS path pre- 
cisely matches the direct (because an overlay node lies 
within the address space of one of the autonomous sys- 
tems of the path), the overlay node 1s still a customer and 
a customer still provides transit. Whether that customer 
has an autonomous system or instead pays a monthly fee 
for a residential connection hardly matters. 


Overlay networks bypass traffic engineering deci- 
sions. It is unclear to what extent the excessively long 
latency paths are deliberately chosen by network admin- 
istrators. One might worry that a successful deploy- 
ment of PeerWise would hamper ISP efforts to shape 
traffic toward slower, but less utilized, links. Peer- 
Wise is not intended for high-bandwidth transfers. Its 
structure discourages bandwidth consumption: we in- 
tend to shave packet transmission latency and, because 
each pair of nodes must strive to maintain the fair- 
ness of the application-level SLA that connects them, 
they may not consume unnecessarily. Downloading a 
large file through PeerWise may not reduce the down- 
load time significantly, considering the many other bot- 
tlenecks in the network (loss, client-side queuing, server 
load, etc.) [21, 4, 22]. 


8.2 Implications for Users 


Forwarding traffic through and on behalf of others raises 
issues of privacy and liability for PeerWise users. AI- 
though unencrypted traffic is “public” regardless of the 
path it takes, it is reasonable to assume that users would 
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be more reluctant to forward their traffic through other 
users than directly through the “faceless” ISPs. Another 
concern is being held liable for forwarding potentially il- 
licit traffic on behalf of another user. While some such 
traffic may be straightforward to filter (and negotiate in 
a PeerWise SLA), say by mechanisms similar to parental 
controls, such an approach requires knowing “question- 
able’ destinations ahead of time, and leads to both false 
negatives and positives. A more general mechanism for 
non-repudiation—a means of verifiably proving to au- 
thorities the source of forwarded traffic—may be more 
appropriate, but is beyond the scope of this paper. 

A potential extension of PeerWise would be to limit 
one’s neighbors to a set of trusted users, determined for 
example via friend-of-friend links in an online social net- 
work, similar to the f2f file store [16]. While such an 
extension may obviate the concerns of non-repudiation, 
it may exacerbate privacy concerns; users may be less 
inclined to forward private traffic through their friends. 

Interestingly, PeerWise can assist in securing an end- 
user’s traffic. Reis et al. demonstrated that some ISPs 
modify users’ web pages in transit [26]. PeerWise could 
assist in routing around such ISPs, or perhaps in lending 
greater credence to a page’s authenticity. 


9 Conclusions 


PeerWise is based on building overlay networks from 
mutually advantageous peerings; we show that such a 
simple, locally enforced mechanism 1s sufficient to pro- 
vide detour routes in the Internet. Surprisingly, pairs of 
nodes can help each other: few nodes are so well po- 
sitioned that they need no help, and few are so poorly 
positioned that they can help no one. Our evaluation 
of PeerWise on two sets of real world latencies and on 
PlanetLab shows that most nodes can find good detours, 
reducing latency by at least 10 ms and 10%. PeerWise 
finds detours to both regionally and globally popular des- 
tinations, as well as to websites that use other latency- 
reduction techniques such as mirroring or DNS redirec- 
tion. Most detours are long-lived and stable and reflect 
well the performance of applications using them. 


Acknowledgments 


We are grateful to our shepherd, Venugopalan Ramasubrama- 
nian, and to the NSDI reviewers for their help in improving 
this paper. We also thank Peter Druschel, Bo Han, Jay Lorch, 
Harsha Madhyastha, Justin McCann, Larry Michele, Alan Mis- 
love, Vivek Pai, and Angie Wu for their comments. This work 
was supported by NSF grants CNS-0435065, CNS-0643443 
and CNS-0626629, and by a Microsoft Live Labs fellowship. 


References 
[1] Alexa. http://www.alexa.com/. 


[2] D. G. Andersen, H. Balakrishnan, M. F. Kaashoek, and 
R. Morris. Resilient overlay networks. In SOSP, 2001. 


NSDI ’09: 6th USENIX Symposium on Networked Systems Design and Implementation 


[3] 


[5] 
[6 
[7] 
[8] 


— 


[9] 


[10] 


[11] 


[29] 


[30] 


[31] 


A. Bharambe, J. R. Douceur, J. R. Lorch, T. Moscibroda, 
J. Pang, S. Seshan, and X. Zhuang. Donnybrook: En- 
abling large-scale, high-speed, peer-to-peer games. In 
SIGCOMM, 2008. 

N. Cardwell, S. Savage, and T. Anderson. Modeling TCP 
latency. In IEEE Infocom, 2000. 

B. Cohen. Incentives build robustness in BitTorrent. In 
P2PEcon, 2003. 

J. Corbo and D. Parkes. The price of selfish behavior in 
bilateral network formation. In PODC, 2005. 

L. Cox and B. Noble. Samsara: Honor among thieves in 

eer-to-peer storage. In SOSP, 2003. 

. Dabek, R. Cox, F. Kaashoek, and R. Morris. Vivaldi: a 
decentralized network coordinate system. In SIGCOMM, 
2004. 

L. Gao. On inferring autonomous system relationships 
in the Internet. IEEE/ACM Transactions on Networking, 
9(6):733-745, 2001. 

K. Gummadi, S. Saroiu, and S. Gribble. King: Estimat- 
ing latency between arbitrary Internet end hosts. In JMW, 
2002. 

K. P. Gummadi, H. Madhyastha, S. D. Gribble, H. M. 
Levy, and D. J. Wetherall. Improving the reliability of in- 
ternet paths with one-hop source routing. In OSDI, 2004. 
J. Ledlie, P. Gardner, and M. Seltzer. Network coordinates 
in the wild. In NSDI, 2007. 

J. Ledlie, M. Seltzer, and P. Pietzuch. Proxy network co- 
ordinates. Tech. rep., Imperial College London, 2008. 

D. Levin, R. Baden, & Lumezanu, N. Spring, and 
B. Bhattacharjee. Motivating participation in Internet 
routing overlays. In NetEcon, 2008. 

D. Levin, R. Sherwood, and B. Bhattacharjee. Fair file 
swarming with FOX. In IPTPS, 2006. 

J. Li and F. Dabek. F2F: reliable storage in open net- 
works. In JPTPS, 2006. 

C. Lumezanu, D. Levin, and N. Spring. PeerWise discov- 
ery and negotiation of faster paths. In HotNets, 2007. 

A. Nakao and L. Peterson. Scalable routing overlay net- 
works. In ACM SIGOPS Operating Systems Review, 
2006. 

A. Nakao, L. Peterson, and A. Bavier. A routing underlay 
for overlay networks. In SIGCOMM, 2003. 

T. S. E. Ng and H. Zhang. Predicting Internet network dis- 
tance with coordinates-based approaches. In IVFOCOM, 
2002. 

J. Padhye, V. Firoiu, D. Towsley, and J. Kurose. Mod- 
eling TCP throughput: A simple model and its empirical 
validation. In SIGCOMM, 1998. 

J. Padhye and S. Floyd. Identifying the TCP behavior of 
web servers. In SIGCOMM, 2001. 

M. Piatek, T. Isdal, T. Anderson, A. Krishnamurthy, and 
A. Venkataramani. Do incentives build robustness in Bit- 
Torrent? In NSDI, 2007. 

Pyxida. http://pyxida.sourceforge.net/. 

L. Qiu, Y. R. Yang, Y. Zhang, and S. Shenker. On self- 
ish routing in Internet-like environments. In SIGCOMM, 
2003. 

C. Reis, S. D. Gribble, T. Kohno, and N. C. Weaver. 
Detecting in-flight page changes with web tripwires. In 
NSDI, 2008. 

S. Saroiu, K. P. Gummadi, R. J. Dunn, S. D. Gribble, and 
H. M. Levy. An analysis of Internet content delivery sys- 
tems. In OSDI, 2002. 

S. Savage, T. Anderson, A. Aggarwal, D. Becker, 
N. Cardwell, A. Collins, E. Hoffman, J. Snell, A. Vahdat, 
G. Voelker, and J. Zahorjan. Detour: A case for informed 
Internet routing and transport. [EEE Micro, 19(1):50-59, 
1999, 

S. Savage, N. Cardwell, D. Wetherall, and T. Ander- 
son. TCP congestion control with a misbehaving receiver. 
ACM CCR, 29(5):71-78, 1999. 

M. Sirivianos, J. H. Park, X. Yang, and S. Jarecki. Dande- 
lion: Cooperative content distribution with robust incen- 
tives. In USENIX, 2007. 

L. Subramanian, I. Stoica, H. Balakrishnan, and R. Katz. 
OverQoS: An overlay based architecture for enhancing 
Internet QoS. In NSDI, 2004. 


USENIX Association 


The USENIX Association 


Since 1975, the USENIX Association has brought together the 
community of system administrators, developers, program- 
mers, and engineers working on the cutting edge of the com- 
puting world. USENIX conferences have become the essential 
meeting grounds for the presentation and discussion of the 
most advanced information on new developments in all aspects 
of advanced computing systems. USENIX and its members are 
dedicated to: 


¢ problem-solving with a practical bias 

¢ fostering technical excellence and innovation 

* encouraging computing outreach in the community at large 
¢ providing a neutral forum for the discussion of critical issues 


For more information about membership and its benefits, 
conferences, or publications, see http://www.usenix.org. 


SAGE, a USENIX 


Special Interest Group 


SAGE is a Special Interest Group of the USENIX Association. 
Its goal is to serve the system administration community by: 


¢ Establishing standards of professional excellence and recog- 
nizing those who attain them 


Promoting activities that advance the state of the art or the 
community 

Providing tools, information, and services to assist system 

administrators and their organizations 

¢ Offering conferences and training to enhance the technical 
and managerial capabilities of members of the profession 


Find out more about SAGE at http://www.sage.org. 


Thanks to USENIX & SAGE Corporate Supporters 
USENIX Patron 


Microsoft: 


Research 


USENIX Benefactors 





7} 


invent 


L NetApo'’ 


Google 





POWERED BY INTELLEC? 


DRIVEN BY VALUES 


Rp ~ SUN 


microsystems 


The Network is the Computer™ 


USENIX & SAGE Partners USENIX Partners SAGE Partner 
Ajava Systems, Inc. Cambridge Computer Services, Inc. MSB Associates 
DigiCert® SSL Certification GroundWork 
FOTO SEARCH Stock Footage Open Source Solutions 
and Stock Photography Xirrus 


Hyperic Systems Monitoring 
Splunk 
Zenoss 








